I need match about 25 million address records with about 200,000 other address records. I also would like to have a small degree of fuzziness so comparing for exact matches is kind of out. The addresses are parsed into components in both data sets. And they are both stored in a SQL Server 2008 database.
I had an idea to do comparisons in batches (grouping the batches by state) until I reached the end, dumping matches into a temporary database. This would be done in a .NET application, but I don't think this is too efficient since I have to pull the data from SQL into the application and iterate over it one by one. Threading could speed up the process, but I don't know by how much.
I also thought about indexing the 25 million records into a Lucene index and utilizing the filtering in that to trim narrow down potential matches.
Are either of these a good approach? What other options are there?