Merging two data frames with certain conditions in R

Question

I have two data frames:

df1
Syllable Duration Pitch
@         0.08    93
@         0.05    107
@         0.13    56
@         0.07    95
@         0.07    123

df2
Syllable Duration 
@        0.08 
@        0.05 
@        0.07
@        0.07

I want to merge them into another data frame:

df3
Syllable Duration Pitch
@        0.08     93
@        0.05     107
@        0.07     95
@        0.07     123

The problem is that I have repeated Syllable and Duration values. I've tried this code, but it gives me incorrect Pitch:

df3 <- merge(df2, df1[!duplicated(df1$Syllable),], by="Syllable")

df3
Syllable Duration Pitch
@        0.08     93
@        0.05     93
@        0.07     93
@        0.07     93

This is really somewhat difficult. How do you know which value pair is correct? Do you just want to keep the entries which are in df2 Syllable & Duration columns and filter out all values from df1 which are not in df2? — Cactus, Jun 05 '18 at 11:33
@rashid Yes, exactly. I want to filter out the rows that are not in df2. And the needed rows are going from top to bottom the way they are, so I guess it is somehow possible to do that. — user9895243, Jun 05 '18 at 11:38
Are you looking for an anti-join, i.e. exclude from `df1` those rows for which the combination of `Syllable` and `Duration` is not in `df2` ? — jogo, Jun 05 '18 at 11:40
@jogo Yes, and I've tried a default anti-join as well. But it gave me repeated values, too. — user9895243, Jun 05 '18 at 11:42
imho it is a inner join of `df1` and `unique(df2)` https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right — jogo, Jun 05 '18 at 11:56
related: https://stackoverflow.com/questions/50691572/comparing-two-data-frames-according-to-several-columns-simultaneously-in-r — jogo, Jun 05 '18 at 12:11

jogo · Answer 1 · 2018-06-08T13:00:35.567

With data.table you can do:

library("data.table")
df1 <- fread(
"Syllable Duration Pitch
@ 0.08 93
@ 0.05 107
@ 0.13 56
@ 0.07 95
@ 0.07 123")
df2 <- fread(
"Syllable Duration 
@ 0.08 
@ 0.05 
@ 0.07
@ 0.07")
merge(df1, unique(df2))
# > merge(df1, unique(df2))
#    Syllable Duration Pitch
# 1:        @     0.05   107
# 2:        @     0.07    95
# 3:        @     0.07   123
# 4:        @     0.08    93

or without sorting:

merge(df1, unique(df2), sort=FALSE)
# > merge(df1, unique(df2), sort=FALSE)
#    Syllable Duration Pitch
# 1:        @     0.08    93
# 2:        @     0.05   107
# 3:        @     0.07    95
# 4:        @     0.07   123

this last is the same as:

df1[unique(df2), on=c("Syllable", "Duration")]
# > df1[unique(df2), on=c("Syllable", "Duration")]
#    Syllable Duration Pitch
# 1:        @     0.08    93
# 2:        @     0.05   107
# 3:        @     0.07    95
# 4:        @     0.07   123

With base `R`:

df1 <- read.table(header=TRUE, text=
"Syllable Duration Pitch
@         0.08    93
@         0.05    107
@         0.13    56
@         0.07    95
@         0.07    123")

df2 <- read.table(header=TRUE, text=
"Syllable Duration 
@        0.08 
@        0.05 
@        0.07
@        0.07 ")
merge(df1, unique(df2))
merge(df1, unique(df2), sort=FALSE)

Cactus · Accepted Answer · 2018-06-05T11:51:10.820

I would propose using the dplyr package. If you use it, you can select the columns which you want to join by. When you join, you should use an semi_join instead of an inner_join. The difference is that inner_join keeps all combinations and possibly duplicates rows ("If there are multiple matches between x and y, all combination of the matches are returned.")

semi_joinon the other hand does: "A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x."

For your case, you can use semi_join(df1, df2, by = c("Syllable", "Duration"))to merge dataframes. The byvector defines the column names you want to join by.

This gives you what you wanted:

  Syllable Duration Pitch
1        @     0.08    93 
2        @     0.05   107
3        @     0.07    95
4        @     0.07   123

score 1 · Answer 3 · answered Jun 05 '18 at 12:27

1

#now keeps unique values for Syllable and the Pitch Values

df1 <- df1[order(df1$Syllable),]

df4<-merge(df2,df1)

df5<-df4[!duplicated(df4$Syllable),]

answered Jun 05 '18 at 12:27

Marta

27
7

I can not get the desired result with your code. Did you test it? – jogo Jun 05 '18 at 13:00
Yes, I tested using the dataframe defined in this question. For the Pitch values I used (1,0,1,0,1). And it worked. – Marta Jun 07 '18 at 15:52
Can you put the used data in your answer (the question gave only a print of the dataframes) and can you please use the same Pitch values as in the question? – jogo Jun 08 '18 at 07:24
I just did with your same dataframes. I get for df5, 3 rows,with the values are join the 2 dataframes (syllable :0.5,0.7,and 0.8) with their corresponding pithc values. For me it's working. if you can give me further details we may be able to correct the code. – Marta Jun 08 '18 at 12:20

Merging two data frames with certain conditions in R

3 Answers3

With base R:

With base `R`: