I'm attempting to use the merge
command for the first time have been stuck with the following problem that I can't seem to get around. I have two dataframe that I am trying to merge using one column (a string of people's names). When I merge the result ends up with some names merging correctly but with the majority not merging only adding additional blank lines. My goal is after I merge the dataframes to have the same number of observations as originally in dataframe1 (118,663) but instead of having 10 variables to have 12 (with all information filled in and no NA values present).
While I understand that to some this question may appear similar to others which discuss merge (inner, outer, left, or right), my point in asking it is two fold. 1. to get help with a problem that I have been trying to fix for several days. 2. to get help because other answers which discuss merge and use the four types mentioned a moment ago do not clearly explain what these types are and how they work.
At first I thought it was something stupid like the names had been misspelled between the two dataframes or I had an extra space before or after the names. I've checked the print results in R and opened both cvs files and can verify that the names are completely identical.
Here is some basic information about the dataframes.
df1(combine)
str(combine)
'data.frame': 118663 obs. of 10 variables:
$ uri: Factor w/ 118163 levels "http://data.parliament.uk/resources/532864",..: 392 393 394 395 396 397 398 399 400 401 ...
$ answer.date : Factor w/ 470 levels "2016-07-07","2016-07-11",..: 5 5 5 5 5 4 4 4 4 4 ...
$ answering.body : Factor w/ 33 levels "Cabinet Office",..: 8 8 8 8 8 8 8 8 8 8 ...
$ date.tabled : Factor w/ 543 levels "2016-07-05","2016-07-06",..: 5 5 5 5 5 5 5 5 5 5 ...
$ question.text : Factor w/ 117729 levels "To ask Mr Chancellor of the Exchequer, how many complaints relating to class 2 national insurance contributions have been recei"| __truncated__,..: 199 234 236 214 212 198 226 193 190 207 ...
$ tabling.member...label: Factor w/ 753 levels "Biography information for Adam Holloway",..: 105 105 105 62 123 9 112 112 112 112 ...
$ tabling.member.printed: Factor w/ 795 levels "Adam Holloway",..: 105 105 105 62 123 9 112 112 112 112 ...
$ title : Factor w/ 118163 levels "House of Commons Tabled Parliamentary Question 2016/17 41835",..: 396 394 395 474 459 432 433 434 435 436 ...
$ uin : int 42286 42282 42283 42418 42391 42347 42351 42352 42353 42354 ...
$ mpnames : chr "Rachael Maskell" "Rachael Maskell" "Rachael Maskell" "Luciana Berger" ...
and the second dataframe that I am attempting to merge it with
df(constituencies)
str(constituencies)
'data.frame': 811 obs. of 3 variables:
$ party : Factor w/ 17 levels "Alliance","Conservative",..: 2 2 8 9 2 9 12 6 2 2 ...
$ constituency: Factor w/ 650 levels "Aberavon","Aberconwy",..: 628 251 614 578 110 40 309 586 482 483 ...
$ mpnames : Factor w/ 811 levels "Adam Afriyie",..: 1 2 4 3 5 6 8 9 10 11 ...
As is hopefully clear I am attempting to merge them using the mpnames variable. I started out with the following code. Which as I mentioned seemed to only work for a few of the names.
combine_constituencies <- merge(combine, constituencies, by = "mpnames", all = TRUE, incomparables = NA)
I added in the incomparables = NA
based on suggestions I saw on here to people having merging problems even though in this instance it doesn't appear to add anything that wasn't already there. The result of this merge gives me the correct number of variables, 12 in total, but as you can see the number of observations has increased by 621.
str(combine_constituencies)
'data.frame': 119284 obs. of 12 variables:
When I view the results of combine_constituencies
I get something that resembles this example.
mpnames answer.date date.tabled ... party constituency
Zac Goldsmith 2016-04-11 2016-03-23 NA NA
Zac Goldsmith 2016-06-27 2016-06-14 NA NA
Zac Goldsmith NA NA Conservative Richmond Park
The result I am trying to get is something closer to this.
mpnames answer.date date.tabled ... party constituency
Zac Goldsmith 2016-04-11 2016-03-23 Conservative Richmond Park
Zac Goldsmith 2016-06-27 2016-06-14 Conservative Richmond Park
If the problem isn't with the variable I am using to merge, which from the several checks I mentioned above, I don't believe it is, what else might be going on here?
In an effort to be thorough I also tried using join
which gave me the total number of observations I am aiming for but filled in NA for both the party and constituency variable similar to the example just above.
Any suggestions would be much appreciated.