Why is address data so hard to get right?

Address data is everywhere.  Approximately 80% of databases and 25% of business documents contain address data[1].  What’s more, a lot of this data is lacking in quality.  Gartner has estimated that more than 25% of critical data in large corporations is flawed[2].  Data quality in general is an issue for most organizations, but address data quality seems to be a particular challenge.

What makes address data harder to get right?

We like to say that one of the problems with addresses is that everybody’s got one.  There are many, many different ideas of how an address should be represented and stored.  It would be nice if there was such a thing as “the right answer” for any particular address, but the fact is that it isn’t that easy.  There are a lot of reasons why addresses are hard to get right:

  • Options for layout
  • Options for standardization
  • Alternate names
  • Rural mailing addresses versus street addresses

Options for Layout

Address layout refers to the structure of an address.  There is no clear standard for this.  Canada Post publishes some guidelines for structuring an address on an envelope, but even within the guidelines there is room for a lot of variation.  For example, how many address lines should there be?  When you capture or store an address, should city, province and postal code be on the same line or in separate fields?  Having a wide variety of structures means that different people will interpret (or misinterpret) your address layout differently.  If you have to deal with address data from a variety of sources, you’ve no doubt had to cope with many different ideas of how addresses should be structured.

Options for Standardization

Even once you’ve settled on the overall structure of your address, there are many, many variations on the details of the individual address elements themselves.  For example, do you put apartment numbers in front of the street number with a dash delimiter or after the street address using a keyword like “APT”?  Do you leave punctuation in or strip it out?  Does your address data contain accented characters?  Is it all upper case or do you use mixed case?  What about keywords, are they spelled out or abbreviated?  If you use abbreviations are they applied consistently?  Variations in standardization can make two identical delivery points look very different, even to the point of it being impossible, using standard database utilities and text matching tools, to detect that two addresses are really the same delivery point.

Alternate Names

Many municipalities and streets have multiple names.  Is your address in Toronto, North York or Willowdale?  The answer, unfortunately, may be “all of the above”.  Are you on Queen Street or Highway 7?  In Brampton the answer could be “yes”.  Handling alternative names, and even variations in abbreviation and punctuation, can be a real challenge.  For example, which is right: SAINT-BERNARD, ST-BERNARD, or ST BERNARD?

Surely There’s a “Right” Answer…

Canada Post publishes a list of the official mailing addresses in Canada.  Shouldn’t this list make it easy to get each address exactly right?  It would be nice if that were true, but unfortunately it isn’t.  It is true that Canada Post licenses data files that define the official Canadian mailing addresses and they publish guidelines for address layouts and rules for standardizing addresses, all of which is helpful.  However, in some ways, Canada Post actually makes matters a little worse.

At the end of the day, Canada Post is not a data quality company, they are a logistics company.  Their primary concern is not making sure that addresses are right so much as making sure that they are deliverable.  Being a federal government agency, they also have some political motivations which can override purist concerns about accuracy.

To pass the SERP tests, address accuracy software needs to conform to the SERP rules.  Some of these rules have a bias towards customer choice rather than strict accuracy.  Some of the rules are also counter-intuitive from an accuracy perspective.  Some examples include:

  • When assessing the quality of an address match, every element has equal weight.  That means that province is as important as street type and postal code is as important as street direction.  At least that’s what Canada Post SERP rules say.
  • Some addresses are not allowed to be fixed.  Rural addresses and large volume receiver addresses that have valid postal codes must be “declared valid” and other changes[3] must not be made.
  • Some kinds of fixes are forbidden, such as one which changes an address’ delivery mode.
  • Canada Post accepts many specific city name and street name alternatives.  Many other alternative names are not permissible.
  • For some street types, the mailer has the option of using the French translation of the street type (e.g. RUE) even if the address is in English (ST), and vice versa, they can use the English translation of the street type (AVE) even if the address is French (AV).

All of these mean that there can be more than one right answer for a particular address, and that sometimes the right answer is not the acceptable answer (as far as Canada Post is concerned).  Consider the following example:

APT. 1004
1004-1085 STEELES W
ON M2R 2T1

These addresses are each nearly right in terms of being complete and accurate, but they don’t look very much alike.  You have to know a fair bit about Canada Post’s SERP rules to be able to tell that these two addresses are actually for the very same delivery point.  That’s where a tool like nCode comes in very handy.

Another thing that can make addresses harder to get right is that a small percentage of the population, but a large proportion of the country, uses rural addressing.  Instead of the typical street addresses, rural addresses use postal stations, PO boxes, rural routes and even general delivery.  People with rural mailing addresses also have street addresses, but these are different from one another.  Consider these addresses:

N0M 2T0
N0M 2T0
PO BOX 293
N0M 2T0

What’s the difference between these addresses?  One is Tele-Atlas’ view of the street address, one is Navteq’s view of the street address, and one is Canada Post’s view of the mailing address.  They all refer to the same household.  Which one is right?  It depends on whether you are looking to mail something or if you are looking for driving directions, and even then there seems to be a difference of opinion.  It just goes to show that with address data, there often isn’t a simple answer.

So is Bad Address Data Just a Fact of Life?

Absolutely not! With the help of the powerful, SERP recognized nCode line of address quality tools, you can make sure that all of the wide variety of address data problems you face are detected and corrected before they ever find their way into your database.  Furthermore, with nCode you can make sure that the quality of your address data doesn’t degrade over time.  You may not be able to do much about the problems in the address data you receive, but there is no need to let these problems into your address database.

[1] The Data Warehousing Institute, 2002: Special Report “Data Quality and the Bottom Line”
[2] Gartner press release, 2004.
[3] There are certain corrections, referred to by Canada Post as “Optional Corrections” which SERP software is allowed to make to rural addresses. These corrections are limited and don’t include many of the kinds of corrections that most people would expect or hope for. nCode will make as many optional corrections to rural and LVR addresses as it can.

Post a Reply