0

I am using Access.
Szenario
At work I've got a table with around 300k rows which defines Person IDs to House IDs with the associated information (First Name, Last Name, City, "Street + StreetNumber", Postal Code). Each person can live in n houses, and each house can inhabit n of those people.

When I am visited by various persons I get a table. This table is filled in by a human being, so there are no IDs in it and unfortunately often have spelling errors and information missing. It should contain "First Name", "Last Name", "Street & Nr", "City", "Postal Code".

To integrate the data I need the IDs from the persons. To counter the spelling error problem I want to build a table which gives me results ordered by "matching priority".

The handfilled Table is called tbl_to_fill and got empty Person-ID row, an indexed autonumber and First Name, Last Name, Street & Nr, City and Postal Code. The table with the relation information is called tbl_all.

So if I find an exact match (with a join query) from tbl_to_fill to tbl_all for "First Name", "Last Name" and "Postal Code" or "First Name", "Last Name", "Street & Nr", "City" it gets "matching priority" 1. If I find an exact match with only "Last Name", "Postal Code" or "Last Name", "City", "Street & Nr" I get a "matching priority" 2. And there are few more levels.

Then comes the tricky part:
Now I built a "tbl_filter" from "tbl_to_fill" with tweaked informations: The street numbers are cut, common spelling errors are replaced with a '*' (for example a common spelling error in german names: ph - f, like Stefan and Stephan), city names are shortened after the last space " " found and some more.

With this table I look for the same criteria as stated above, but with a "LIKE '*' & tbl_filter.Field & '*'" - query. And they get matching priority same as above + 10. Now those join queries and the Like queries are all aggregated via a union query, let's call this Query 001 quni All rows.

I got this to work exactly like I want to, but it takes AGES, every time I run the last query.

My Question
Has someone done something alike? What can I do to fasten the process?

As many of my matching criteria expect First Name and Last Name to fit and then some more, should I first extract only matching rows from "tbl_all" via a make table and then run the according queries?
Should I use regex instead of like queries on a field which contains all information concatenated by a "-"? Is there a better way to assign those priorities? Maybe all in one query via an Iff - function?

Select ..., matching_priority = IFF(tbl_all."First Name" =  tbl_to_Fill."FirstName",1,
    IFF(...)
)
From tbl_all;

I am a self tought access developer, so i often have problems knowing which approach is the most optimized. I use VBA regularly and dont shy away from it, so if you got a solution via VBA let me know.

J Lee
  • 21
  • 4

2 Answers2

0

I think you might simplify your approach a bit if you were to use fuzzy text search. The common way of doing this is to use the Levenshtein distance, or the number of changes it takes to turn one string into another. A nice implementation of the Levenshtein is here:

Levenshtein Distance in Excel

In this manner, you can find the closest possible match on city, street, first name, last name, etc. You can even set up a "reasonable" limit, such as any record where the Levenshtein > 10 may be "unreasonable." I threw out 10, but it will vary depending on your data.

Some optimization notes:

  1. Based on the fact that you have 300,000 rows, I would go so far as to say you still need to narrow your results a bit. Reading all 300,000 records for each match is unreasonable. For example, if you had state (which I see you don't), then a reasonable limit is to say the state must match. That takes your 300,000 down to a much lower number. You may want to also assume that the first letter of the last name must match. That will narrow it down further. Etc, etc.
  2. If you can, I would use an actual RDBMS instead of Access to store the data and let the DB server do the heavy lifting. PostgreSQL, in particular, has nice fuzzy search capabilities available through extensions
Community
  • 1
  • 1
Hambone
  • 15,600
  • 8
  • 46
  • 69
0

One thing I have done in similar situations is to extract the first few characters of the last name, the first one or two characters of the first name, and the postal code, and write it from both tables to temp tables, and do the matching query on the truncated tables. After some tinkering with how many characters to extract, I can usually find a balance between speed and false positive matches that I can live with, then have a human review the resulting list. The speed difference can be significant - if instead of matching Schermerstien, Stephan to Schermerstien, Ste*an, you are matching Scher, St to Scher, St, you can see the processing advantage. But it only works if there is a small intersection between the tables and you can tolerate a human review step.