for my research I have to match two data sets containing fund information. Unfortunately there is no common identifier. The good thing is that I have an identifier in both documents for the document number which however can contain multiple funds. If there are multiple funds in the document (e.g. 20) I can only match via the fund's name which can differ sometimes slightly. Note that the number of funds per document is identical in noth data sets. After searching a little bit I tried to use this function(found here: agrep: only return best match(es)):
ClosestMatch2 = function(string, stringVector){
distance = levenshteinSim(string, stringVector);
stringVector[distance == max(distance)]
}
This worked fine for most funds, however I discovered two problems:
- Sometimes there are multiple matches
- Sometimes I have wrong matches
For example: This function matched "INSTITUTIONAL LARGE CORE FUND" to "Transamerica Partners Institutional Core Bond" instead of "Transamerica Partners Institutional Large Core".
I have two ideas to circumvent these problems:
- I use another matching function to verify the function above. I.e. I only accept matching if both function yield the same result.
- I somehow adapt the function above.
I would really appreciate your help. Best, Laurenz