What is the "data table" way of doing this join/merge?

Question

I have a "dictionary" table like this:

dict <- data.table(
  Nickname = c("Abby", "Ben", "Chris", "Dan", "Ed"),
  Name = c("Abigail", "Benjamin", "Christopher", "Daniel", "Edward")
)
dict
#    Nickname        Name
# 1:     Abby     Abigail
# 2:      Ben    Benjamin
# 3:    Chris Christopher
# 4:      Dan      Daniel
# 5:       Ed      Edward

And a "data" table like this:

dat <- data.table(
  Friend1 = c("Abby", "Ben", "Ben", "Chris"),
  Friend2 = c("Ben", "Ed", NA, "Ed"),
  Friend3 = c("Ed", NA, NA, "Dan"),
  Friend4 = c("Dan", NA, NA, NA)
)
dat
#    Friend1 Friend2 Friend3 Friend4
# 1:    Abby     Ben      Ed     Dan
# 2:     Ben      Ed      NA      NA
# 3:     Ben      NA      NA      NA
# 4:   Chris      Ed     Dan      NA

I would like to produce a data.table that looks like this

result <- data.table(
  Friend1.Nickname = c("Abby", "Ben", "Ben", "Chris"),
  Friend1.Name = c("Abigail", "Benjamin", "Benjamin", "Christopher"),
  Friend2.Nickname = c("Ben", "Ed", NA, "Ed"),
  Friend2.Name = c("Benjamin", "Edward", NA, "Edward"),
  Friend3.Nickname = c("Ed", NA, NA, "Dan"),
  Friend3.Name = c("Edward", NA, NA, "Daniel"),
  Friend4.Nickname = c("Dan", NA, NA, NA),
  Friend4.Name = c("Daniel", NA, NA, NA)
)
result
# sorry, word wrapping makes this too annoying to copy

And this is the solution I had in mind:

friend_vars <- paste0("Friend", 1:4)
friend_nicks <- paste0(friend_vars, ".Nickname")
friend_names <- paste0(friend_vars, ".Name")
setnames(dat, friend_vars, friend_nicks)
for (i in 1:4) {
  dat[, friend_names[i] := dict$Name[match(dat[[friend_nicks[i]]], dict$Nickname)], with = FALSE]
}

Is there a more "data-table-esque" way to do this? I'm sure it's nice and efficient, but it's ugly to read, and part from data.table's in-place assignment I don't feel like I'm taking good advantage of what the package has to offer.

I'm also not a very strong SQL user, and I'm not too comfortable with join terminology. I have a feeling that Data.table - left outer join on multiple tables could be useful here but I'm not sure how to apply it to my situation.

score 6 · Accepted Answer · answered Aug 31 '15 at 19:09

6

Using data.table 1.9.5:

for (nm in names(dat)) {
    on = setattr("Nickname", 'names', nm)
    dat[dict, paste0(nm, ".Name") := i.Name, on=on]
}

We can join using on= instead of setting keys. Now you can use setcolorder() to reorder the names.

I avoid reshaping data unless absolutely necessary. This is where update while join comes in really handy. And now with the on= argument, I couldn't resist posting an answer :-).

answered Aug 31 '15 at 19:09

Arun

116,683
26
284
387

I upgraded to 1.9.5 just for this, and I'm happy I did. The `on=` syntax is great! By the way, if you knew in advance that you were were going to switch from wide format to long format, would it in general be more efficient to merge first and melt second, or melt first and merge second? – shadowtalker Aug 31 '15 at 19:36
Great! My guess would be that melting + joining would be faster (since you need to join only once).. and melt is *usually* inexpensive operation in terms of speed (if implemented properly). – Arun Aug 31 '15 at 19:40
1

Also: where is the `i.*` syntax documented? I couldn't find it in `[.data.table` or `:=` – shadowtalker Aug 31 '15 at 19:56
Also also: can you comment on the issue I raised in the comments on Chris' answer? – shadowtalker Aug 31 '15 at 20:23
1

@ssdecontrol It's been two years since I asked about the documentation for `i`, but yeah it still seems to be missing... http://stackoverflow.com/questions/16843728/accessing-y-columns-with-duplicated-names-in-j-of-xy-j-merges#comment24289062_16843833 – Frank Aug 31 '15 at 20:45
2

@ssdecontrol see [#1038](https://github.com/Rdatatable/data.table/issues/1038). You're free to issue PR. I don't have plans to work on it anytime soon. – Arun Aug 31 '15 at 21:02

score 2 · Answer 2 · answered Aug 31 '15 at 17:51

2

I didn't come up w/ a solution that matches exactly your result, but you might be able to work w/ something like this:

dat[, id := .I]
dat.m <- melt(dat, id.vars='id', variable.name='Friend', value.name='Nickname')
setkey(dict, Nickname)
dat.m[, Name := dict[Nickname, Name]]
> dat.m
    id  Friend Nickname        Name
 1:  1 Friend1     Abby     Abigail
 2:  2 Friend1      Ben    Benjamin
 3:  3 Friend1      Ben    Benjamin
 4:  4 Friend1    Chris Christopher
 5:  1 Friend2      Ben    Benjamin
 6:  2 Friend2       Ed      Edward
 7:  3 Friend2       NA          NA
 8:  4 Friend2       Ed      Edward
 9:  1 Friend3       Ed      Edward
10:  2 Friend3       NA          NA
11:  3 Friend3       NA          NA
12:  4 Friend3      Dan      Daniel
13:  1 Friend4      Dan      Daniel
14:  2 Friend4       NA          NA
15:  3 Friend4       NA          NA
16:  4 Friend4       NA          NA

The variable id was just a placeholder so I could melt the DT.

answered Aug 31 '15 at 17:51

Chris Watson

1,347
1
9
24

1

Yeah, I think long format is better. The standard way to do that merge is `setkey(dat.m, Nickname); dat.m[dict, Name := i.Name]`. The OP could also revert to wide format with `dcast(dat.m, id ~ Friend, value.var=c("Name","Nickname"))`, though it still doesn't look exactly like their desired output. – Frank Aug 31 '15 at 18:01
@Frank that answer doesn't work very well if there are unused elements in `dict`. Consider `dict <- rbind(dict, data.table(Nickname = "Fran", Name = "Francesca"))`. You get extra rows in the result corresponding to the unused rows in `dict`. – shadowtalker Aug 31 '15 at 20:22
@ssdecontrol `dat.m[dict]` will have more rows, yes, but `dat.m` cannot get additional rows that way. If you try running my code and then run `dat.m` afterwards, you'll see that it just adds a column to the table -- such merges cannot add rows. – Frank Aug 31 '15 at 20:43

jlhoward · Answer 3 · 2015-08-31T18:23:15.530

setkey(dict,Nickname)
dat[,paste(names(dat),"Name",sep="."):=lapply(.SD,function(x)dict[J(x)]$Name)]
setcolorder(dat,c(1,5,2,6,3,7,4,8))
dat
#    Friend1 Friend1.Name Friend2 Friend2.Name Friend3 Friend3.Name Friend4 Friend4.Name
# 1:    Abby      Abigail     Ben     Benjamin      Ed       Edward     Dan       Daniel
# 2:     Ben     Benjamin      Ed       Edward      NA           NA      NA           NA
# 3:     Ben     Benjamin      NA           NA      NA           NA      NA           NA
# 4:   Chris  Christopher      Ed       Edward     Dan       Daniel      NA           NA

jeremycg · Answer 4 · 2015-08-31T19:13:17.273

in base, super ugly:

cbind(dat, lapply(dat, function(x){dict$Name[match(x, dict$Nickname)]}))

   Friend1 Friend2 Friend3 Friend4          V2       NA     NA     NA
1:    Abby     Ben      Ed     Dan     Abigail Benjamin Edward Daniel
2:     Ben      Ed      NA      NA    Benjamin   Edward     NA     NA
3:     Ben      NA      NA      NA    Benjamin       NA     NA     NA
4:   Chris      Ed     Dan      NA Christopher   Edward Daniel     NA

What is the "data table" way of doing this join/merge?

4 Answers4