0

Here's the problem I'm trying to solve:

I have two different character columns. Both columns do come from different datasets and describe exactly the same objects, but with some slight differences and ordered in a different way. Here's a very little sample of each one.

A                       B
IFK Kristianstad        Chekhov
HC Vardar               Barça
Rhein-Neckar Löwen      Vardar
FC Barcelona            Rhein-Neckar Löwen
Chekhovski Medvedi      PSG
Paris Saint-Germain     Kristianstad

My goal is to match each value from column A with its correspondant in column B and get them together into a single dataframe, so my desired output will look like this:

A                       B
IFK Kristianstad        Kristianstad
HC Vardar               Vardar
Rhein-Neckar Löwen      Rhein-Neckar Löwen
FC Barcelona            Barça
Chekhovski Medvedi      Chekhov
Paris Saint-Germain     PSG

Moreover, as you can see even if in almost all cases, smaller or bigger, there is at least a match between each value on A with its correspondant in B, in one case (Paris Saint-Germain / PSG) they are linked because one is the acronym of the other. I have no idea about if there is a simple way to deal with it without having to rename anyone of the two values involved.

When it comes to the rest, until now I've tried unsuccessfully with functions such as match() or str_match().

teogj
  • 289
  • 1
  • 11
  • What do you want to do if there are multiple matches--would you want all the matches in a list? Or, will there not be multiple matches? – Andrew Jul 23 '19 at 13:58
  • I think you need `stringdist` i.e. someething like `apply(outer(df1$A, df1$B, FUN = stringdist), 1, function(x) df1$B[which.min(x)])` – akrun Jul 23 '19 at 14:00
  • `match` and `str_match` are for exact matches. You want fuzzy matching, using e.g., `agrep` or `pmatch`. You may also want to do some special checks for your case, like extract all the capital letters and see if they match (that would work for your PSG, but will also have many false positives). Here are a couple answers to get you started: [Merging two Data Frames using Fuzzy/Approximate String Matching in R](https://stackoverflow.com/a/38246955/903061), [Merging through fuzzy matching of variables in R](https://stackoverflow.com/a/7120442/903061) – Gregor Thomas Jul 23 '19 at 14:02
  • I should review my entire datasets but I'm almost sure that there aren't multiple matches. Anyway even if there were my goal is just to get together the values that describe the same object in the same row. – teogj Jul 23 '19 at 14:02
  • 1
    This will require multiple steps. Step 1 is to get rid of acronyms. Any fuzzy matching algorithm will have problems in matching PSG to Paris Saint-Germain. You might get away with Barça, but even that one I would include in step 1. Step 2 fuzzy match the rest and see what falls between the cracks. Use the links @Gregor supplied or use `stringdist` to match between the columns. Rinse and repeat as needed. – phiver Jul 23 '19 at 14:30

0 Answers0