7

I'm working on a survey program where people will be given promotional considerations the first time they fill out a survey. In a lot of scenarios, the only way we can stop people from cheating the system and getting a promotion they don't deserve is to check street address strings against each other.

I was looking at using levenshtein distance to give me a number to measure similarity, and consider those below a certain threshold a duplicate.

However, if someone were looking to game the system, they could easily write "S 5th St" instead of "South Fifth Street", and levenshtein would consider those strings to be very different. So then I was thinking to convert all strings to a 'standard address form' i.e. 'South' becomes 's', 'Fifth' becomes '5th', etc.

Then I was thinking this is hopeless, and too much effort to get it working robustly. Is it?

I'm working with PHP/MySql, so I have the limitations inherent in that system.

Juha Syrjälä
  • 33,425
  • 31
  • 131
  • 183
user151841
  • 17,377
  • 29
  • 109
  • 171
  • 1
    What if instead of "S. 5th St." someone enters "S. 4th St."? This couldn't be used to game the system (assuming you're mailing the promotional stuff), but it could disqualify people for living one block over. Just an edge case to test. – Bill the Lizard May 20 '10 at 16:27
  • @Bill that scenario is not a problem because then they wouldn't recieve their promotional consideration. Unless they're in cahoots with the folks who reside on that house address on 4th street, but there's only so many households they can conspire with. It's self-limiting, I think :) – user151841 May 20 '10 at 16:39
  • @user15841: No, I mean what if those two people legitimately sign up independently of each other? Your algorithm needs to be smart enough to see the difference between those two addresses, but also smart enough that it sees the original examples you gave as the same. – Bill the Lizard May 20 '10 at 16:49
  • You mean, if someone accidentally gives someone else's address? Yeah, that's a problem, but I don't see how the system could address it without being open to more gaming ("Are you sure you meant 4th street? We already have one for that address. Care to try again?" ) – user151841 May 20 '10 at 16:51
  • No, I meant if two people living at very similar but different addresses both sign up, one of them might not get their prize. – Bill the Lizard May 20 '10 at 16:55
  • Is street address only identifier? How about email ID / cookies / IP based detection? – Jack May 20 '10 at 16:59
  • @Jack they could be using any computer anywhere - at home, at the library, at grandma's house. – user151841 May 20 '10 at 17:04
  • @Bill well, I think the street name has to be one of the major identifiers. If the algorithm is going to confuse Fourth Street with Fifth Street, that obviously won't work. No different from confusing Main Street with Town Street :) – user151841 May 20 '10 at 17:05

3 Answers3

4

I think your second idea is better than using Levenshtein distance. If you try to compare the addresses for similarity, then two different people who live nearby each other might accidentally "cheat" one another out of their prize. If I live at "S. 4th St." but my neighbor at "S. 5th St." already signed up, those two addresses might seem too similar by Lev distance.

You could reduce (but probably not eliminate) a lot of potential cheating by running addresses through a synonym normalizer. Before you check for equality, just convert

North -> N.
East -> E.
...
First -> 1st
Second -> 2nd
Third -> 3rd
...
Street -> St.
Avenue -> Ave.

The longer the list of synonyms you come up with, the better it will be at catching matches. It will be a little bit slower at processing, but addresses are tiny.

This is similar to converting strings to all lower (or upper) case before comparing them. (Which I also recommend, naturally.)

Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880
  • Oh, finally I understand what you're saying! I haven't used levenshtein, so I wasn't familiar enough with it to forsee how that situation would arise :) – user151841 May 21 '10 at 13:11
  • Additionally, it must be noted that multiple residents may be living in the same building... That's where it gets tricky, even after standardization. For example "511 N 15th St, Unit 123" vs "511 NORTH 15th St, Apt 124" – A Sharma Jul 07 '14 at 18:54
  • 1
    You should consider using string distance comparison on the synonyms too. Otherwise "South" will become "S", but "Soith" (typo) will not, and "Soith"->"S" is not similar. Also, be mindful that there are thousands of unicode characters that will produce characters that look like a-z characters but aren't. Furthermore, "Street"->"St." can lead to a false positive for the 'Saint' abbreviation "St.". – Richard EB Mar 05 '16 at 23:30
  • Furthermore, it significantly helps to extract the number parts and the text parts, and weight similar numbers higher than dissimilar text tokens. (In this situation, I'd consider the Jaro Winkler edit distance over the Levenshtein distance for comparing the text tokens, since 'Ave'-'Avenue' is within a sensible threshold, whereas it wouldn't be for lev. distance.) – Richard EB Mar 05 '16 at 23:46
0

You can use Google Map API (or any other mapping API) to normalize addresses as geo-location (lat/long).

Franci Penov
  • 74,861
  • 18
  • 132
  • 169
  • 1
    Wouldn't work because the majority of these geospatial api's fail to consider the apartment numbers (e.g., 12th floor 6th room). – code4life May 20 '10 at 16:33
  • Yes, such normalization wouldn't be 100% accurate and you would need to do additional checks as well. – Franci Penov May 20 '10 at 16:55
0

See these questions for related discussion.

  • Normalize your data first as much as possible:

    avenue -> ave road -> rd Rd. -> rd

    first -> 1 1st -> 1

You could look into SOUNDEX or something similar to catch cases where words sound same but have different spelling (e.g. Schmitt, Schmitd, Smith). SOUNDEX works on word level, so you'll need to split address to words first, and compare the SOUNDEX values.


You could also feed the addresses to some geo location service such as Google Maps, store resulting longitude and latitude to your database. When a new address is entered, you just get its longitude/latitude and compare against existing locations in your database. See this question for details.

Community
  • 1
  • 1
Juha Syrjälä
  • 33,425
  • 31
  • 131
  • 183