1

Trying to create a column in dataframe df1 based on match in another dataframe df2, where df1 is much bigger than df2:

df1$val2 <- df2$val2[match(df1$id, df2$IDs)]

This doesn't quite work because df2$IDs column is a list:

> df2
             IDs val2
1              0    1
2           1, 2    2
3           3, 4    3
4           5, 6    4
5           7, 8    5
6          9, 10    6
7 11, 12, 13, 14    7

It only works for the part where the list has 1 element (row 1: ..$ : int 0 above). For all other rows the 'match(df1$id, df2$IDs)' returns NA.

Test of matching some individual numbers works just fine with double brackets:

2 %in% df2[[2,'IDs']]

So, I either need to modify the column df2$IDs or need to perform match operation differently. The df1 has many other columns, so does the df2, but df2 is much shorter in rows.

The case can be reproduced with the following:

IDs <- c("[0]", "[1, 2]", "[3, 4]", "[5, 6]", "[7, 8]", "[9, 10]", "[11, 12, 13, 14]")
val2 <- c(1,2,3,4,5,6,7)
df2 <- data.frame(IDs, val2)
df2$IDs <- lapply(strsplit(as.character(df2$IDs), ','), function (x) as.integer(gsub("\\s|\\[|\\]", "", x)))
id <- floor(runif(100, min=0, max=15))
df1 <- data.frame(id)
str(df1)
str(df2)
df1$val2 <- df2$val2[match(df1$id, df2$IDs)]
Frank
  • 66,179
  • 8
  • 96
  • 180
Dimon
  • 436
  • 5
  • 15
  • 1
    A reproducible example could come handy... Have you tried unlist(df2$IDs)? – Cris Sep 05 '17 at 18:01
  • @Cris. Thank you. This would produce a flat list, but we need to keep appropriate list per row. For example in row 2 we have list: int 1 2. This tells us that all IDs with value 1 and 2 map to that row. – Dimon Sep 05 '17 at 18:06
  • You are talking about an example, but that's not the same as building and providing a *reproducible* example. See https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/28481250#28481250 – Frank Sep 05 '17 at 18:52
  • 1
    Try `df2$IDs %>% modify(c(1, 1)) %>% flatten_int()`. I haven't tested it because I can't copy/paste your example, but what you essentially need is a way to index specific elements in a list, which you can do with `purrr::modify()`. You will need `library(dplyr)` `library(purrr)` for these commands. – Rich Pauloo Sep 05 '17 at 19:38
  • @Rich Pauloo. The df2$IDs appears to be a list of lists (please see edits above). The elements can be of different lengths, sometimes very large. I tried your suggestion but it returned: `[1] 0 1 3 5 7 9`. Looks like it only makes one vertical slice of it. – Dimon Sep 05 '17 at 20:27
  • 1
    @Frank. The reproducible example is posted – Dimon Sep 05 '17 at 21:34
  • @Cris. The reproducible example is posted – Dimon Sep 05 '17 at 21:35
  • @Rich Pauloo. The reproducible example is posted – Dimon Sep 05 '17 at 21:35
  • @Dimon, disregard this if you know it already, but if you find yourself working frequently with nested lists, it might be good to run through these tutorials. I found them super helpful even after I knew a thing or two about lists. https://jennybc.github.io/purrr-tutorial/index.html – Rich Pauloo Sep 05 '17 at 22:45
  • @Rich Pauloo. Thank you – Dimon Sep 05 '17 at 23:28

1 Answers1

2

List columns are clumsy to work with. If you convert df2 to a more vanilla format, it works:

DF2 = with(df2, data.frame(ID = unlist(IDs), val2 = rep(val2, lengths(IDs))))
df1$m = DF2$val2[ match(df1$id, DF2$ID) ]

If you want list columns just for browsing, it is quick to do...

aggregate(ID ~ ., DF2, list)

  val2             ID
1    1              0
2    2           1, 2
3    3           3, 4
4    4           5, 6
5    5           7, 8
6    6          9, 10
7    7 11, 12, 13, 14

.


Fyi, the match approach will not extend naturally to joining on more columns, so you might want to eventually learn data.table and its "update join" syntax for this case:

library(data.table)
setDT(df1); setDT(df2)

DT2 = df2[, .(ID = unlist(IDs)), by=setdiff(names(df2), "IDs")]
df1[DT2, on=.(id = ID), v := i.val2 ]
Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    Thank you. That worked. Another way I was thinking is to do this by string matching (i.e. keeping the original IDs as strings). I haven't gotten it working. Besides,it is probably slower than your version, however, more generic one. – Dimon Sep 05 '17 at 23:32