-1

I am struggling to understand how to combine in R two tables when the common variables are not exactly similar.

To give the context, I have downloaded two sources of information about politicians, from Twitter and from the administration and created two different data frames. In the first data frame (dataset 1), I have the name of the politicians present on Twitter. However, I don’t know if these politicians are now in function or not. To discover that, I could use the second date frame. The second data frame (dataset 2) contains the name and other information about the politicians now in function. The first and last names are the only variables contained in both tables. The two tables do not have the same number of rows.

Problem:

  1. The names in the first dataset were indicated as one variable (first name + last name) whereas in the second dataset the names were separated in two variables (last name and first name). I used separate to separate the name column in the first tables. parliament_twitter_tempdata <- separate(parliament_twitter_tempdata,col=name, into=c("firstname","lastname"),extra ="merge”). However I have problems with it as both datasets have:
    • composed first names and composed last names
    • first name and last name in the incorrect order

I have included a picture of a part (from lastname "J" to "M") of both datasets to illustrate the difference between the similar values or the inversion of lastname, firstname.

How could I improve my code?

  1. The names in both tables are not completely similar. Some people did not write the official name in Instagram. Is there any function which could compare the two tables, find the set of variables that correspond to around 80% and remplace the name in the data frame 1 (from Twitter) with the official name of data frame 2 ? Ex. Dataset 1 : Marie Gabour ; Dataset 2 : Marie Gabour Jolliet —> Replace the Marie Gabour from dataset 1 into Marie Gabour

Could someone help me there? Many thanks !

[Part of the dataset 1 after having separate (lastname from "J" to "M" )1 [Part of the name in dataset 2 (lastname from "J" to "M") 2

  • 4
    Please edit your question as shown [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) to make it reproducible and easier to understand. – NelsonGon May 01 '19 at 12:40
  • See https://stackoverflow.com/questions/22894265/how-to-perform-approximate-fuzzy-name-matching-in-r for some ideas. – cgrafe May 01 '19 at 14:45

1 Answers1

1

Fuzzy matching might be a way to move forward:

https://cran.r-project.org/web/packages/fuzzyjoin/fuzzyjoin.pdf

Also, cleaning functions may help (e.g., using toppper or removing whitespace on the key).

dca
  • 594
  • 4
  • 18