1

I need match about 25 million address records with about 200,000 other address records. I also would like to have a small degree of fuzziness so comparing for exact matches is kind of out. The addresses are parsed into components in both data sets. And they are both stored in a SQL Server 2008 database.

I had an idea to do comparisons in batches (grouping the batches by state) until I reached the end, dumping matches into a temporary database. This would be done in a .NET application, but I don't think this is too efficient since I have to pull the data from SQL into the application and iterate over it one by one. Threading could speed up the process, but I don't know by how much.

I also thought about indexing the 25 million records into a Lucene index and utilizing the filtering in that to trim narrow down potential matches.

Are either of these a good approach? What other options are there?

Cuthbert
  • 2,908
  • 5
  • 33
  • 60

1 Answers1

0

For a first pass do an exact

For fuzzy you can use Levenstien Distance

Levenstien Distance TSQL

You can also run Levenstien in .NET
It might make sense to bring the 200,000 into a .NET collection
And then compare the 25 million one at a time to the 200,000
I assume the .NET implementation is faster just don't know how much faster.

C# Levenshtein

MSSQL has SOUNDEX but it is way to fuzzy.

Hopefully you have a valid state to filter on

Hopefully you have a some valid zip code
If the zip code is valid then filter to only that zip code

Community
  • 1
  • 1
paparazzo
  • 44,497
  • 23
  • 105
  • 176