1

for my research I have to match two data sets containing fund information. Unfortunately there is no common identifier. The good thing is that I have an identifier in both documents for the document number which however can contain multiple funds. If there are multiple funds in the document (e.g. 20) I can only match via the fund's name which can differ sometimes slightly. Note that the number of funds per document is identical in noth data sets. After searching a little bit I tried to use this function(found here: agrep: only return best match(es)):

ClosestMatch2 = function(string, stringVector){

  distance = levenshteinSim(string, stringVector);
  stringVector[distance == max(distance)]

}

This worked fine for most funds, however I discovered two problems:

  1. Sometimes there are multiple matches
  2. Sometimes I have wrong matches

For example: This function matched "INSTITUTIONAL LARGE CORE FUND" to "Transamerica Partners Institutional Core Bond" instead of "Transamerica Partners Institutional Large Core".

I have two ideas to circumvent these problems:

  1. I use another matching function to verify the function above. I.e. I only accept matching if both function yield the same result.
  2. I somehow adapt the function above.

I would really appreciate your help. Best, Laurenz

Community
  • 1
  • 1
Laurenz
  • 11
  • 5
  • It seems like you're looking for the presence of entire sub-string (like "Large", "Partners" etc..) and not "mismatches" within them. Is that right? – Arun Apr 22 '13 at 11:16
  • Most of the time that is correct. However, there are rare occasions where the sub-strings are likely but not identical, For example: Mid-Cap Fund & Mid Cap Fund or MODERATE STRATEGY ALLOCATION FUND & Moderate Strategy Alloc. Fund – Laurenz Apr 22 '13 at 13:12

1 Answers1

0

The RecordLinkage package allows you to match strings with several approaches (e.g. levenshtein but also other measures) and it allows you to define thresholds or even the use of classification model to indicated when an match is ok for you.

  • It'd be nice to show the possibility using this package or write this under comments. It's not an answer yet, I believe. – Arun Apr 22 '13 at 11:29
  • Thanks for your answer! I checked the package and only found the Jaro-Winkler (jarowinkler) and some modification of the levenshtein distance (levenshteinDist) which I will try as alternatives. Since you said that there are "several" approaches I wondered if I have overseen any others. Thanks for clarification! – Laurenz Apr 22 '13 at 13:08
  • e.g. hamming distance for character strings is also quickly possible. but you can supply in the comparison a strcmpfun which must have as arguments the two strings to be compared and return a similarity value in the range of 0-1 –  Apr 22 '13 at 13:20