R merge by string adds duplicate

Question

I'm attempting to use the merge command for the first time have been stuck with the following problem that I can't seem to get around. I have two dataframe that I am trying to merge using one column (a string of people's names). When I merge the result ends up with some names merging correctly but with the majority not merging only adding additional blank lines. My goal is after I merge the dataframes to have the same number of observations as originally in dataframe1 (118,663) but instead of having 10 variables to have 12 (with all information filled in and no NA values present).

While I understand that to some this question may appear similar to others which discuss merge (inner, outer, left, or right), my point in asking it is two fold. 1. to get help with a problem that I have been trying to fix for several days. 2. to get help because other answers which discuss merge and use the four types mentioned a moment ago do not clearly explain what these types are and how they work.

At first I thought it was something stupid like the names had been misspelled between the two dataframes or I had an extra space before or after the names. I've checked the print results in R and opened both cvs files and can verify that the names are completely identical.

Here is some basic information about the dataframes.

df1(combine)
str(combine)

'data.frame':   118663 obs. of  10 variables:
$ uri: Factor w/ 118163 levels "http://data.parliament.uk/resources/532864",..: 392 393 394 395 396 397 398 399 400 401 ...
$ answer.date           : Factor w/ 470 levels "2016-07-07","2016-07-11",..: 5 5 5 5 5 4 4 4 4 4 ...
$ answering.body        : Factor w/ 33 levels "Cabinet Office",..: 8 8 8 8 8 8 8 8 8 8 ...
$ date.tabled           : Factor w/ 543 levels "2016-07-05","2016-07-06",..: 5 5 5 5 5 5 5 5 5 5 ...
$ question.text         : Factor w/ 117729 levels "To ask Mr Chancellor of the Exchequer, how many complaints relating to class 2 national insurance contributions have been recei"| __truncated__,..: 199 234 236 214 212 198 226 193 190 207 ...
$ tabling.member...label: Factor w/ 753 levels "Biography information for Adam Holloway",..: 105 105 105 62 123 9 112 112 112 112 ...
$ tabling.member.printed: Factor w/ 795 levels "Adam Holloway",..: 105 105 105 62 123 9 112 112 112 112 ...
$ title                 : Factor w/ 118163 levels "House of Commons Tabled Parliamentary Question 2016/17 41835",..: 396 394 395 474 459 432 433 434 435 436 ...
$ uin                   : int  42286 42282 42283 42418 42391 42347 42351 42352 42353 42354 ...
$ mpnames               : chr  "Rachael Maskell" "Rachael Maskell" "Rachael Maskell" "Luciana Berger" ...

and the second dataframe that I am attempting to merge it with

df(constituencies)
str(constituencies)

'data.frame':   811 obs. of  3 variables:
$ party       : Factor w/ 17 levels "Alliance","Conservative",..: 2 2 8 9 2 9 12 6 2 2 ...
$ constituency: Factor w/ 650 levels "Aberavon","Aberconwy",..: 628 251 614 578 110 40 309 586 482 483 ...
$ mpnames     : Factor w/ 811 levels "Adam Afriyie",..: 1 2 4 3 5 6 8 9 10 11 ...

As is hopefully clear I am attempting to merge them using the mpnames variable. I started out with the following code. Which as I mentioned seemed to only work for a few of the names.

combine_constituencies <- merge(combine, constituencies, by = "mpnames", all = TRUE, incomparables = NA)

I added in the incomparables = NA based on suggestions I saw on here to people having merging problems even though in this instance it doesn't appear to add anything that wasn't already there. The result of this merge gives me the correct number of variables, 12 in total, but as you can see the number of observations has increased by 621.

str(combine_constituencies)
'data.frame':   119284 obs. of  12 variables:

When I view the results of combine_constituencies I get something that resembles this example.

mpnames         answer.date   date.tabled  ...  party          constituency
Zac Goldsmith   2016-04-11    2016-03-23        NA             NA
Zac Goldsmith   2016-06-27    2016-06-14        NA             NA
Zac Goldsmith   NA            NA                Conservative   Richmond Park

The result I am trying to get is something closer to this.

mpnames         answer.date   date.tabled  ...  party          constituency
Zac Goldsmith   2016-04-11    2016-03-23        Conservative   Richmond Park
Zac Goldsmith   2016-06-27    2016-06-14        Conservative   Richmond Park

If the problem isn't with the variable I am using to merge, which from the several checks I mentioned above, I don't believe it is, what else might be going on here?

In an effort to be thorough I also tried using join which gave me the total number of observations I am aiming for but filled in NA for both the party and constituency variable similar to the example just above.

Any suggestions would be much appreciated.

Your output is precisely what I would expect from doing an outer join between your two data tables. If you tell us what output you want maybe you will get an answer. — Tim Biegeleisen, Aug 18 '16 at 04:00
I've added a bit more information and an additional example of the output I am aiming for. Thank you for letting me know I needed to be a bit clearer. — C. Martin, Aug 18 '16 at 04:07
@akrun This isn't a direct duplicate of your label, and I'm surprised you didn't answer this question. There is no way to modify his call to `merge` to generate his desired output. — Tim Biegeleisen, Aug 18 '16 at 04:11
I added a bit more clarification in the body of the question to help identify why this question is different to the one you mentioned. I had seen the one @akrun mentioned but did not find an answer to my problem there. Really appreciate your comments and would be very grateful if either of you had any ideas. — C. Martin, Aug 18 '16 at 04:18
You have a `factor` column in one dataset while in other it is `character`. Also, check if there are leading/lagging spaces in the "mpnames" in each of the dataset. The reason why number of observations increase is the duplicates for 'mpnames' in both the dataset. It needs to have a unique 'mpnames' in at least one of the dataset. One option will be to add a sequence column by mpnames in each of the dataset and merge by both sequence and mpnames or remove the duplicate rows in one dataset and do the merge. — akrun, Aug 18 '16 at 04:21
@akrun I've checked for leading/lagging spaces, as I mentioned, both in R and by hand and absolutely none were present. I had also noticed the `factor` vs. `character` problem but a) wasn't sure if it was the source of my problem, and b) how to fix it. Does it seem like this is likely the culprit? — C. Martin, Aug 18 '16 at 04:25
As I read your post again, I think it must be due to the duplicates for mpnames in both dataset. I edited my earlier comment. Please check. — akrun, Aug 18 '16 at 04:26
I think [this](http://stackoverflow.com/questions/30410681/merge-in-r-results-in-more-rows-than-one-of-the-data-frames) is a similar problem. Please go through the advise in the comments. — akrun, Aug 18 '16 at 04:28
@akrun I looked over the link you provided and while it initially sounds like a similar problem I don't think the root cause is the same. I also looked at your revised comment, and after reviewing both of these think I failed to provide enough background on what the dfs are. Each row in the first df is a unique written comment made by one of several hundred people. This is why I have duplicate `mpnames` and why I can't delete them. My goal is in this larger df each time a person's name appears to have their corresponding party and constituency added. Is `merge` maybe not the best command here? — C. Martin, Aug 18 '16 at 04:54

score 0 · Answer 1 · answered Aug 18 '16 at 05:07

0

Swap the order of the data frames and left join.

Here's a stripped down version of your datasets:

constituencies <- data.frame(
  mpnames = c("Zac Goldsmith", "Adam Afriyie"),
  constituency = c("Aberavon", "Richmond Park"),
  party = c("Alliance", "Conservative"),
  stringsAsFactors = FALSE
)

combine <- data.frame(
  mpnames = c("Zac Goldsmith", "Zac Goldsmith"),
  answer.date = as.Date(c("2016-04-11", "2016-06-27")),
  date.tabled = as.Date(c("2016-03-23", "2016-06-14")),
  stringsAsFactors = FALSE
)

Here's the join that you want:

library(dplyr)
left_join(constituencies, combine, by = "mpnames")
##         mpnames  constituency        party answer.date date.tabled
## 1 Zac Goldsmith      Aberavon     Alliance  2016-04-11  2016-03-23
## 2 Zac Goldsmith      Aberavon     Alliance  2016-06-27  2016-06-14
## 3  Adam Afriyie Richmond Park Conservative        <NA>        <NA>

The important insight is that mpnames are unique in the constituencies dataset but not the combine dataset.

answered Aug 18 '16 at 05:07

Richie Cotton

118,240
47
247
360

Thank you most particularly for your last sentence. It now makes sense why I would need to flip the order of the dfs. I added in the `stringAsFactors = FALSE` to both lines where I read in the csv file for each df, then I ran both lines of code you provided. The resulting new df has 22,457 obs with 13 variables, with NA in every column except `party` `constituency` and `mpnames`. The times where the rest of the columns have merged correctly are the same instance where I previously got them to merge correctly. Any thoughts? Did I miss something by adding the `stringAsFactors` where I did? – C. Martin Aug 18 '16 at 05:58
Are these datasets publically available? If you can link to them, it would make the question much easier to answer. – Richie Cotton Aug 18 '16 at 06:38
sadly neither dataset is publicly available, I created both by web-scraping in R – C. Martin Aug 18 '16 at 17:39

R merge by string adds duplicate

1 Answers1