0

I am having trouble joining to tables together, the 2 columns have similar data but not exactly the same data.

Example:

Table 1: Column 1 = "Expect rain for todays weather" Table 2: Column 2 = "Expect rain for todays weather and overcast clouds"

I have tried using the below to join but it is not 100% accurate: ... FROM [Table1] as [one] LEFT JOIN[Table2] as [two] ON [one].[column1] LIKE '%' + [two].[column2] + '%'

Would fuzzy matching be the best way to get the most accurate matches? Or are there other methods of joining Tables with columns that have similar but not exact data?

Any assistance and advise will be greatly appreciated!

Lesego Zim
  • 11
  • 2

2 Answers2

3

It is difficult in the situation you describe if one of the values in not exactly a subset of the other.

I use a Damerau-Levenshtein Distance function but it is really meant to measure typographical differences. I use it find typographical mistakes in street and suburb names so they are quite short strings. I don't know if the link is one I tried at one point but I currently use one written for CLR which is a lot faster. Probably too slow on long stings if you have a lot of records anyway.

Maybe consider using a Full Text Index and a CONTAINS query to find similar patterns of words.

Galaxiom
  • 109
  • 4
0

It just occurred to me, but I have not investigated, that a Damerau-Levenshtein Distance might be able to be calculated using whole words rather than just characters.

The CONTAINS or FREETEXT (or better still CONTAINSTABLE or FREETEXTTABLE) query would still be used to narrow down the candidates. One of the benefits of using these full text indexing solutions is the ability to automatically index grammatical variants of words such as plurals and different tenses as well as specify the proximity of multiple words in the phrase while ignoring insignificant words like 'the' and 'a' etc.

I originally implemented the code for the Weighted Damerau-Levenshtein function in Access VBA. It is here and includes a link to the original Excel version it was based. It isn't difficult to convert this code to VB.NET to use as a CLR function.

If it could be modified for words rather than characters, then it would score differences on word reversals, insertions and omissions, as it does on the basis of characters in the original function.

Galaxiom
  • 109
  • 4