Use look-up table in R to create new column based on another df

Question

I'm having a hard time understanding how to do the equivalent of a look-up table in R. I've seen people suggest you should use "merges" in lieu of look-up tables, but I'm not sure what the right approach is:

Let's say I have the following:

set.seed(42)
person_ids <- data.frame(person_1_id = stringi::stri_rand_strings(100, 10, '[A-Z]'), 
                 person_2_id = stringi::stri_rand_strings(100, 10, '[A-Z]'))

team_id_lookup <- data.frame(person_id = stringi::stri_rand_strings(100, 10, '[A-Z]'), 
                  team_ids = floor(runif(100, min=0, max=500)))

I want to create two new columns in person_ids -- team_id_1 and team_id_2, which use the look up dataframe to find the corresponding team_ids for a given person_id and taking that value.

What is the right approach here?

One thought is that you could use the `hash` package, and also you could consider a named vector rather than a data frame for your look up values. Or even a named list. Then you can set the two team_id values based on names with matching person_id. — Elin, Aug 04 '18 at 18:49
Seems that there are no matches due to your random strings generation process. I'll try to post a similar but simpler example. Hope it helps.... — AntoniosK, Aug 04 '18 at 18:52
Don't you need 200 names in the look up table and to have those names match the names in the person table? — Elin, Aug 04 '18 at 18:58
I do want to point out that with names consisting of 10 randomly selected letters you're not likely to get duplicate names but in the real world you are. Therefore you will need to introduce some kind of duplicate checking prior to doing any join. In that sense giving each person a random id like this is possibly useful. — Elin, Aug 04 '18 at 19:16

AntoniosK · Accepted Answer · 2018-08-04T19:01:00.650

4

set.seed(42)
person_ids <- data.frame(person_1_id = stringi::stri_rand_strings(10, 1, '[A-Z]'), 
                         person_2_id = stringi::stri_rand_strings(10, 1, '[A-Z]'))

team_id_lookup <- data.frame(person_id = stringi::stri_rand_strings(5, 1, '[A-Z]'), 
                             team_ids = floor(runif(5, min=0, max=500)))

library(dplyr)

person_ids %>%
  left_join(team_id_lookup, by=c("person_1_id"="person_id")) %>%
  left_join(team_id_lookup, by=c("person_2_id"="person_id")) %>%
  rename(team_id_1 = team_ids.x,
         team_id_2 = team_ids.y)

#    person_1_id person_2_id team_id_1 team_id_2
# 1            X           L       257        NA
# 2            Y           S       223        NA
# 3            H           Y        NA       223
# 4            V           G        NA        NA
# 5            Q           M        NA        NA
# 6            N           Y        NA       223
# 7            T           Z        NA       452
# 8            D           D       195       195
# 9            R           M        NA        NA
# 10           S           O        NA        NA

Only people found in your look up table will have a match. All other will have NA.

edited Aug 04 '18 at 19:01

answered Aug 04 '18 at 18:53

AntoniosK

15,991
2
19
32

This is exactly what I was looking for. Thank you! By the way, if my team_id_lookup column contains more columns than I don't want in the final dataframe, is there a quick way to subset team_id_lookup in the piping chain? I tried doing a subset command, and dplyr doesn't like it – Parseltongue Aug 04 '18 at 18:56
If you want to ignore some if your columns try the `select` command from `dplyr` package. Use `select(x,y)` to keep `x,y` or use `select(-x,-y)` to remove `x,y`. – AntoniosK Aug 04 '18 at 18:59
Also note that you'll get some warning because your datasets have `factor` varaibles. You can update them to `character` variables in advance. – AntoniosK Aug 04 '18 at 19:02
Using one letter names for 100 rows is going to make the names not unique however, just saying, the theory is fine. However I think with 100 rows you'll very likely get almost all the letters at least once. – Elin Aug 04 '18 at 19:14
That's why I used less rows. To make sure I don't get a single letter with multiple values in the look up table. The OP has the opposite problem. He generated so unique values that there were not matches :) – AntoniosK Aug 04 '18 at 19:15
Well OP needlessly generated random variables for names for the look up table. – Elin Aug 04 '18 at 19:17
Indeed. I think he should have created an example with at least some matches and also should have provided he's ideal output. – AntoniosK Aug 04 '18 at 19:20
1

Correct we don't really know if so many NAs are expected and also if it's okay to have the same names in both columns or duplicate names in the same column. Even with n=10 theres a birthday problem issue. – Elin Aug 04 '18 at 19:26

coffeinjunky · Answer 2 · 2018-08-04T19:20:59.947

2

Just to point a base R solution out here:

The idea of a look-up table is that there is a table where you can easily look up information based on the index. The index in R would be e.g. the row name of a table, which you can assign for instance with:

rownames(team_id_lookup) <- team_id_lookup$person_id

You can then use this to lookup your team memberships:

person_ids$team1 <- team_id_lookup[person_ids$person_1_id,"team_ids"]
person_ids$team2 <- team_id_lookup[person_ids$person_2_id,"team_ids"]
person_ids
   person_1_id person_2_id team1 team2
1            X           L   257    NA
2            Y           S   223    NA
3            H           Y    NA   223
4            V           G    NA    NA
5            Q           M    NA    NA
6            N           Y    NA   223
7            T           Z    NA   452
8            D           D   195   195
9            R           M    NA    NA
10           S           O    NA    NA

For reproducability: I used the same data as @AntoniosK but set stringsAsFactors to FALSE, as per Antonios comment.

set.seed(42)
person_ids <- data.frame(
        person_1_id = stringi::stri_rand_strings(10, 1, '[A-Z]'), 
        person_2_id = stringi::stri_rand_strings(10, 1, '[A-Z]'), 
        stringsAsFactors = F)
team_id_lookup <- data.frame(
        person_id = stringi::stri_rand_strings(5, 1, '[A-Z]'), 
        team_ids = floor(runif(5, min=0, max=500)), 
        stringsAsFactors = F)

edited Aug 04 '18 at 19:20

answered Aug 04 '18 at 19:01

coffeinjunky

11,254
39
57

Ah, this is so clean! Thanks a lot – Parseltongue Aug 04 '18 at 19:05
Yep, `base` R can be extremely elegant sometimes. :) – coffeinjunky Aug 04 '18 at 19:07
Not sure it matters, but I'm getting the warning: `Setting row names on a tibble is deprecated.` . Any idea what the non-deprecated strategy is? – Parseltongue Aug 04 '18 at 19:07
Don't use a tibble then. Convert it to a `data.frame` with `as.data.frame()`. – coffeinjunky Aug 04 '18 at 19:07
@coffeinjunky I assume you're using the same data as me, but your result is different. It seems that this approach has a problem with the `factor` variables as it uses the numbers the correspond to the factor levels. Have a quick double check and you'll spot it – AntoniosK Aug 04 '18 at 19:09
StringsAsFactors = FALSE is missing from the OP's code. – Elin Aug 04 '18 at 19:10
Well spotted @AntoniosK You are right. Setting stringsAsFactors = F solves the issue. Thanks for pointing it out! – coffeinjunky Aug 04 '18 at 19:11
@Elin 's answer is ok with `stringsAsFactors = F` because she doesn't use a look up data frame, but a look up vector. – AntoniosK Aug 04 '18 at 19:18
No you want the names as characters no matter what method you use. List, vector or data frame. – Elin Aug 04 '18 at 19:19
We get the same answers now @AntoniosK with `stringsAsFactors = F`. – coffeinjunky Aug 04 '18 at 19:22
Needless to say, `merge`ing is typically preferred. Weird things might happen if some team ids are NAs or if they are duplicate. – coffeinjunky Aug 04 '18 at 19:24
But that was fun. Thanks for the comments @AntoniosK and @Elin! – coffeinjunky Aug 04 '18 at 19:24

score 2 · Answer 3 · answered Aug 04 '18 at 19:06

I'm not completely following the logic of having only 100 team values so I'm doing 200. But this is another approach.

set.seed(42)
person_ids <- data.frame(person_1_id = stringi::stri_rand_strings(100, 10, '[A-Z]'), 
                         person_2_id = stringi::stri_rand_strings(100, 10, '[A-Z]'), 
                                        stringsAsFactors = FALSE)

all_pid <- c(person_ids$person_1_id, person_ids$person_2_id)

team_ids <- floor(runif(200, min=0, max=500))
names(team_ids) <- all_pid

person_ids$team_id_1 <- team_ids[person_ids$person_1_id]
person_ids$team_id_2 <- team_ids[person_ids$person_2_id]
head(person_ids)

  person_1_id person_2_id team_id_1 team_id_2
1  XYHVQNTDRS  WBVOMAOSGV       128       207
2  LSYGMYZDMO  KODFEVCGUH       362       422
3  XDZYCNKXLV  WBIIDJMBZX        78       428
4  TVKRAVAFXP  ZEBDJOYQMC       157       225
5  JLAZLYXQZQ  YJJQSFPVZA       148       366
6  IJKUBTREGN  VFCRHVQNAH       339       337

Use look-up table in R to create new column based on another df

3 Answers3

Linked