0

I have 2 dataframes like this

ID <- c("A","B","C")
Type <- c("PASS","PASS","FAIL")
Measurement <- c("Length","Height","Breadth")
Function <- c("Volume","Area","Circumference")
df1 <- data.frame(ID,Type,Measurement,Function)

ID <- c("A","B","C","C")
Type <- c("PASS","PASS","FAIL","FAIL")
Measurement <- c("Length","Height","Breadth","Breadth_DSPT")
df2 <- data.frame(ID,Type,Measurement)

I am trying to merge these 2 data frames in a way that it returns matching measurements and also returns rows that have the matching measurement concatenated by another string.

My desired output is

  ID Type  Measurement      Function
   A PASS       Length        Volume
   B PASS       Height          Area
   C FAIL      Breadth Circumference
   C FAIL Breadth_DSPT Circumference

I use the merge function like this to get the first 3 rows but how do we match the measurement names in the data frames to return all rows that have the match?

df <- merge(df1,df2,by=c("ID","Type","Measurement"),all.x=T)
Sharath
  • 2,225
  • 3
  • 24
  • 37
  • Seems like you're trying to merge data frames based on partial matching? See if this helps: http://stackoverflow.com/questions/10617377/merge-data-with-partial-match-in-r – Daniel Anderson Mar 06 '17 at 20:26
  • If I understand it properly, you need to use fuzzy join: https://cran.r-project.org/web/packages/fuzzyjoin/fuzzyjoin.pdf, since "Breadth" and "Breadth_DSPT" are similar, but not the same – Mislav Mar 06 '17 at 22:13
  • Is the concatenation always just "string_newpart" ? – thelatemail Mar 06 '17 at 22:27

3 Answers3

3

One way to achieve it is to use the sqldf package:

library(sqldf)

sqldf("select df1.ID, df1.Type, df2.Measurement, df1.Function
      from df1 left join df2 on (df1.ID = df2.ID and 
                                 df1.Type = df2.Type and 
                                 df2.Measurement like df1.Measurement||'%')")

#   ID Type  Measurement      Function
# 1  A PASS       Length        Volume
# 2  B PASS       Height          Area
# 3  C FAIL      Breadth Circumference
# 4  C FAIL Breadth_DSPT Circumference

The last clause in the join (df2.Measurement like df1.Measurement||'%') means that df2$Measurement must be equal to df1$Measurement followed by any character string, but you can specify a more flexible condition using SQL's % and _.

Scarabee
  • 5,437
  • 5
  • 29
  • 55
2

If you simply have a concatenation on the end of the string, you can just do something like:

merge(
  transform(df2, tmpmeas = sub("_.+$", "", Measurement)),
  df1,
  by.x=c("ID","Type","tmpmeas"), by.y=c("ID","Type","Measurement")
)[-3]
#  ID Type  Measurement      Function
#1  A PASS       Length        Volume
#2  B PASS       Height          Area
#3  C FAIL      Breadth Circumference
#4  C FAIL Breadth_DSPT Circumference
thelatemail
  • 91,185
  • 12
  • 128
  • 188
-1

You can use data.table library to do that. First convert your dataframe to datatable, set the key for each table using setkey, then merge.

dt1 <- data.table(df1)
dt2 <- data.table(df2)
setkey(dt1,ID)
setkey(dt2,ID)
merge(dt1,dt2)

#    ID Type.x Measurement.x      Function Type.y Measurement.y
# 1:  A   PASS        Length        Volume   PASS        Length
# 2:  B   PASS        Height          Area   PASS        Height
# 3:  C   FAIL       Breadth Circumference   FAIL       Breadth
# 4:  C   FAIL       Breadth Circumference   FAIL  Breadth_DSPT
Fadwa
  • 1,717
  • 5
  • 26
  • 43