2

Hello I need to merge two data frames. The relevant ID columns of the two data frames contain comma-separated lists of IDs (as string).

For example those two data frames should be merged:

ID             value   
"abc,def,ghi"  5
"jkl"          8
"stu,xyz"      2    


ID             value
"def,xyz"      7
"klm"          6
"opq,stu"      9

My matching criterion would be positive if any of the individual ids in one ID-list is also present in the other. So the merged data frame should look like (for a left-join like merge):

ID             value.Data.frame.1   value.Data.frame.2
"def,xyz"      5                    7
"jkl"          8                    NA
"stu,xyz"      2                    9
"stu,xyz"      2                    7              

(My real data actually has more than one value column in the data frames.) Usually for the merge I would do something like:

 merge(data.frame1, data.frame2, by=ID, all.x=TRUE)

in which case I only merge rows with exact string equality in the ID columns.

Is it possible to override the match criterion in the merge function somehow? Or is there another efficient way to achieve the same goal in R?

datamole
  • 155
  • 7
  • Please provide a reproducible example http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Steven Beaupré Mar 20 '15 at 18:02
  • Can you show the expected result also based on the example provided – akrun Mar 20 '15 at 18:02
  • I've improved the example, hope it's more clear now :-) – datamole Mar 23 '15 at 17:26
  • I'm not sure about your example because `"def,xyz" ` present both in the first row of `data.frame1` and in the third. – David Arenburg Mar 24 '15 at 14:31
  • Other than that, this seem to work on the provided data sets, but I'm not sure about the real ones `indx <- sapply(gsub(",", "|", df2$ID), function(x) any(grepl(x, df1$ID))); library(data.table); setDT(df1)[indx, value2 := df2$value[indx]][]` – David Arenburg Mar 24 '15 at 14:37
  • Hi David, the ambiguity that lines 1 and 3 of data frame one both match line 1 of data frame 2 is on purpose. This might happen. I've modified the desired outcome to make clear how it should be dealt with (I left that open in the beginning in order to be not too restrictive with respect to possible solutions to the more general problem). – datamole Mar 25 '15 at 15:53
  • @David Arenburg: Your suggested code seems to be truncated at the end, so I couldn't really try it. Perhaps just make it an "answer" .-) – datamole Mar 26 '15 at 10:51
  • My suggested code isn't truncated at the end. You can just copy/paste it and try it. Either way, it won't work correctly with your new desired output. – David Arenburg Mar 26 '15 at 11:08
  • @David Arenburg: `setDT(df1)[indx, value2 := df2$value[indx]][]` gives me an error ... – datamole Mar 26 '15 at 12:32
  • What error it gave you? – David Arenburg Mar 26 '15 at 12:33
  • `Error in `[.data.table`(setDT(df1), idx, `:=`(value2, df2$value[idx])) :When deleting columns, i should not be provided` – datamole Mar 26 '15 at 12:47

0 Answers0