1

I have two datasets each that are monthly summaries of different variables for households. I want to merge the two based on the household ID and the month.

df1 one looks like this:

     hh_ids      date total
     <chr>     <chr> <dbl>
1  KELDK13  2013-8-1     1
2  KMOMB02  2013-2-1     1
3  KMOMB02  2013-5-1     2
4  KMOMB04  2013-7-1     2
5  KMOMB04  2013-9-1     1
6  KMOMB06  2013-6-1     1
7  KMOMB14  2013-8-1     1
8  KMOMB16  2013-6-1     1
9  KMOMB17 2012-10-1     1
10 KMOMB17 2012-11-1     2

and the first 10 rows of df2 looks like:

  hh_ids      date    income consumption alcohol cleaning_materials  clothing
1  KELDK01 2012-11-1  62.70588    40.52941       0           0.000000  0.000000
2  KELDK01 2012-12-1  17.64706    42.43530       0           1.058824  7.058824
3  KELDK01 2013-01-1  91.76471    48.23529       0           0.000000  0.000000
4  KELDK01 2013-02-1  91.76470   107.52940       0           0.000000  0.000000
5  KELDK01 2013-03-1 116.47060   114.47060       0           0.000000  0.000000
6  KELDK01 2013-04-1 124.41180   118.29410       0           2.705882 17.647060
7  KELDK01 2013-05-1 137.23530   105.00000       0           1.411765  1.882353
8  KELDK01 2013-06-1 131.52940   109.54120       0           4.352942  2.941176
9  KELDK01 2013-07-1 121.52940   113.47060       0           2.352941 25.882350
10 KELDK01 2013-08-1 123.32940    86.50588       0           2.588235  2.941176

I want the "total" column from to be added on as a column in df2 with the matched hh_ids and date.

I have tried to do the following:

df3<-merge(df2,df1,by=c("hh_ids","date")) 

However, my df2 has 53 rows, and df1 has many more, and the resulting df3 only has 14 rows. Any suggestions would be much appreciated!

  • How many rows you will expect? If your overlap is only 14, then the result is fine. – Roman Aug 04 '16 at 13:10
  • It is possible there are only 14 HH-months that are in both data.frames. Another thing to check is that the variable types are the same for the two data sets so there isn't anything weird going on with factors / variable conversion during the merge. – lmo Aug 04 '16 at 13:11
  • @Jimbou: I expect that all 53 combinations of hh/date should appear in the larger data.frame, so I expect 53 rows. Apologies for the lack of clarity – Ashley Thomas Aug 04 '16 at 13:17

1 Answers1

3

If you want to retain all rows from your df2 even if they do not match to anything in df1 then you can use the all parameter in your merge:

df3 <- merge(df2, df1, by=c("hh_ids","date"), all.x=TRUE)

This is equivalent to doing a LEFT JOIN in SQL between df2 on the left and df1 on the right.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • ouff...18 seconds :) – Sotos Aug 04 '16 at 13:09
  • This is nothing...check out Arun's [profile](http://stackoverflow.com/users/3732271/akrun) if you want to be amazed by an R person :-) – Tim Biegeleisen Aug 04 '16 at 13:10
  • yeah but he doesn't count... he plays on a different league :) – Sotos Aug 04 '16 at 13:12
  • Gordon is king of dupes if you ask me...really each R question is a separate problem in its own right, but so be it. – Tim Biegeleisen Aug 04 '16 at 13:12
  • Thanks for the help. Unfortunately, I'm still having issues.. for some reason when I do the formula above, the "total" column in df2 returns all "NA". Any thoughts as to why this is happening? – Ashley Thomas Aug 04 '16 at 15:20
  • I don't have your actual dataset in front of me, but if a row in `df2` does not match to anything in `df1` then you would get `NA` for the total. Are you sure there is matching data? – Tim Biegeleisen Aug 04 '16 at 15:22