Basic question on merging two data frames based on different variables

Question

I have two different data.frames with diff. sizes.

dim(df1) = 2942 obs.    6 var.
dim(df2)= 16533 obs. 2307 var.

I would like to merge df1 with df2 aiming for a df3 with 2942 observations.

The following variables define observations in data frames: serial (group indetification number), id1 (person identifier from the group ranges from 1 to number of people in the group), Day (the week day when the record was made). The day variable is defined as: Mon.:1; Tue.:2; Wed.:3, Thur.:4, Fri.:5, Sat.:6, Sun.:7

In df2 there are 2 observations for the same serial . I would like to have a df with the serials and id1s' on Day level. So basically I create a new variable index for df1 and df2

library(dplyr)

df1<-df1 %>% 
      mutate(index = group_indices_(df1, .dots=c("serial", "id1", "id2"))) 


df2<-df2 %>% 
      mutate(index = group_indices_(df2, .dots=c("serial", "id1")))

Please see the sample data.

I was using the above code to merge:

library(dplyr)
df3<-inner_join(df1,df2,by=c("index","Day"),suffix=c(".df1",".df2"))

..and I receive a df3 with 65 obs. and 2310 var. and not 2942 obs and 2310 var.

Can somebody explain why I have this issue?

Sample date:

df1

structure(list(serial = c(12, 123, 123, 10, 10), id1 = c(1, 1, 
2, 1, 2), Day = c(1, 3, 2, 4, 2)), class = "data.frame", row.names = c(NA, 
-5L))

df2

structure(list(serial = c(12, 12, 123, 123, 123, 123, 10, 10, 
10, 10, 10, 10), id1 = c(1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 3, 3), 
    id2 = c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2), Day = c(1, 6, 
    3, 7, 2, 7, 4, 7, 2, 7, 4, 7), index = c(7L, 8L, 9L, 10L, 
    11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L)), row.names = c(NA, -12L
), class = "data.frame")

Sample data outcome:

serial id1 id2 Day
12      1   1   1
123     1   1   3
123     2   1   2
10      1   1   4
10      2   1   2

Probably not. Without seeing sample data and output from that sample data, it can only be speculation. — r2evans, Apr 14 '20 at 15:33
Just to understand: do you want to keep the elements of the first table for which there is no information in the second table? In that case, you might try anti_join — johnjohn, Apr 14 '20 at 15:34
Suggested duplicate: [What's the difference between INNER, LEFT, RIGHT, and OUTER joins?](https://stackoverflow.com/q/5706437/903061) — Gregor Thomas, Apr 14 '20 at 16:07
@Gregor i am not sure if this is a duplicate please check the sample data. — Rstudent, Apr 14 '20 at 16:09
Your `df2` has nothing outside of the `by=` columns. Are you looking for `inner_join(df1, df2, by = c("serial", "id1"), suffix = c(".df1", "df2"))`? — r2evans, Apr 14 '20 at 16:10
@r2vevans i receive 5884 observations some are duplicates because it does not merge on Day — Rstudent, Apr 14 '20 at 16:13
Your output is inconsistent: `serial` of 10 is only present in `df2` which does not have `id2`, but your output has an `id2` for those rows. Where do the 1s come from? — r2evans, Apr 14 '20 at 16:17
yes because in df1 an observation is defined by serial (group identifier) , id1 (person level data-defining the person from the group) , id2 (the day of the observation; 1 or 2 )and Day (weekday Mon1 till Sat 7) in df2 an observation is define just by serial id1 and Day. — Rstudent, Apr 14 '20 at 16:21
I suspect this is part-merge, part-aggregation, and you need to be clear what logic you intend to use. For instance: include all data in `df2` and matching data in `df1`, and missing `id2` data defaults to 1. — r2evans, Apr 14 '20 at 16:56
@GKi there is no need for NA, I think the NA is because I defined rthe df1 as df1<-0 df1$serial<-c(12, 12, 123, 123,123, 123, 10, 10, 10, 10, 10, 10) df1$id1<-c(1,1,1,1,2,2, 1, 1,2,2,3,3) df1$id2<-c(1,2,1,2,1,2,1,2,1,2,1,2) df1$Day<-c(1,6,3,7,2,7,4,7,2,7,4,7) so there is a a zero column that later i removes as df1=df1[-1] and created a data frame as df1<-as.data.frame(df1) — Rstudent, Apr 15 '20 at 07:03

4redwood · Accepted Answer · 2020-04-14T18:28:44.127

Based on the number of observations you're asking for, I'm guessing you want to keep all the observations from df1. You also said observations were uniquely identified by serial, id1, id2, and Day. I would try a left join:

left_join(df1,df2,by=c("serial","id1","Day"),suffix=c(".df1",".df2"))

Observations from df1 without a match in df2 will have NAs in the newly added columns.

A right join would lead to keep all the observations from df2 and get rid of the non-matching ones from df1, which seems like the opposite of what you're trying to do.

Per the documentation:

left_join()

return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

Basic question on merging two data frames based on different variables

1 Answers1