2

I'm having a hard time understanding how to do the equivalent of a look-up table in R. I've seen people suggest you should use "merges" in lieu of look-up tables, but I'm not sure what the right approach is:

Let's say I have the following:

set.seed(42)
person_ids <- data.frame(person_1_id = stringi::stri_rand_strings(100, 10, '[A-Z]'), 
                 person_2_id = stringi::stri_rand_strings(100, 10, '[A-Z]'))

team_id_lookup <- data.frame(person_id = stringi::stri_rand_strings(100, 10, '[A-Z]'), 
                  team_ids = floor(runif(100, min=0, max=500)))

I want to create two new columns in person_ids -- team_id_1 and team_id_2, which use the look up dataframe to find the corresponding team_ids for a given person_id and taking that value.

What is the right approach here?

Parseltongue
  • 11,157
  • 30
  • 95
  • 160
  • One thought is that you could use the `hash` package, and also you could consider a named vector rather than a data frame for your look up values. Or even a named list. Then you can set the two team_id values based on names with matching person_id. – Elin Aug 04 '18 at 18:49
  • Seems that there are no matches due to your random strings generation process. I'll try to post a similar but simpler example. Hope it helps.... – AntoniosK Aug 04 '18 at 18:52
  • Don't you need 200 names in the look up table and to have those names match the names in the person table? – Elin Aug 04 '18 at 18:58
  • I do want to point out that with names consisting of 10 randomly selected letters you're not likely to get duplicate names but in the real world you are. Therefore you will need to introduce some kind of duplicate checking prior to doing any join. In that sense giving each person a random id like this is possibly useful. – Elin Aug 04 '18 at 19:16

3 Answers3

4
set.seed(42)
person_ids <- data.frame(person_1_id = stringi::stri_rand_strings(10, 1, '[A-Z]'), 
                         person_2_id = stringi::stri_rand_strings(10, 1, '[A-Z]'))

team_id_lookup <- data.frame(person_id = stringi::stri_rand_strings(5, 1, '[A-Z]'), 
                             team_ids = floor(runif(5, min=0, max=500)))

library(dplyr)

person_ids %>%
  left_join(team_id_lookup, by=c("person_1_id"="person_id")) %>%
  left_join(team_id_lookup, by=c("person_2_id"="person_id")) %>%
  rename(team_id_1 = team_ids.x,
         team_id_2 = team_ids.y)

#    person_1_id person_2_id team_id_1 team_id_2
# 1            X           L       257        NA
# 2            Y           S       223        NA
# 3            H           Y        NA       223
# 4            V           G        NA        NA
# 5            Q           M        NA        NA
# 6            N           Y        NA       223
# 7            T           Z        NA       452
# 8            D           D       195       195
# 9            R           M        NA        NA
# 10           S           O        NA        NA

Only people found in your look up table will have a match. All other will have NA.

AntoniosK
  • 15,991
  • 2
  • 19
  • 32
  • This is exactly what I was looking for. Thank you! By the way, if my team_id_lookup column contains more columns than I don't want in the final dataframe, is there a quick way to subset team_id_lookup in the piping chain? I tried doing a subset command, and dplyr doesn't like it – Parseltongue Aug 04 '18 at 18:56
  • If you want to ignore some if your columns try the `select` command from `dplyr` package. Use `select(x,y)` to keep `x,y` or use `select(-x,-y)` to remove `x,y`. – AntoniosK Aug 04 '18 at 18:59
  • Also note that you'll get some warning because your datasets have `factor` varaibles. You can update them to `character` variables in advance. – AntoniosK Aug 04 '18 at 19:02
  • Using one letter names for 100 rows is going to make the names not unique however, just saying, the theory is fine. However I think with 100 rows you'll very likely get almost all the letters at least once. – Elin Aug 04 '18 at 19:14
  • That's why I used less rows. To make sure I don't get a single letter with multiple values in the look up table. The OP has the opposite problem. He generated so unique values that there were not matches :) – AntoniosK Aug 04 '18 at 19:15
  • Well OP needlessly generated random variables for names for the look up table. – Elin Aug 04 '18 at 19:17
  • Indeed. I think he should have created an example with at least some matches and also should have provided he's ideal output. – AntoniosK Aug 04 '18 at 19:20
  • 1
    Correct we don't really know if so many NAs are expected and also if it's okay to have the same names in both columns or duplicate names in the same column. Even with n=10 theres a birthday problem issue. – Elin Aug 04 '18 at 19:26
2

Just to point a base R solution out here:

The idea of a look-up table is that there is a table where you can easily look up information based on the index. The index in R would be e.g. the row name of a table, which you can assign for instance with:

rownames(team_id_lookup) <- team_id_lookup$person_id

You can then use this to lookup your team memberships:

person_ids$team1 <- team_id_lookup[person_ids$person_1_id,"team_ids"]
person_ids$team2 <- team_id_lookup[person_ids$person_2_id,"team_ids"]
person_ids
   person_1_id person_2_id team1 team2
1            X           L   257    NA
2            Y           S   223    NA
3            H           Y    NA   223
4            V           G    NA    NA
5            Q           M    NA    NA
6            N           Y    NA   223
7            T           Z    NA   452
8            D           D   195   195
9            R           M    NA    NA
10           S           O    NA    NA

For reproducability: I used the same data as @AntoniosK but set stringsAsFactors to FALSE, as per Antonios comment.

set.seed(42)
person_ids <- data.frame(
        person_1_id = stringi::stri_rand_strings(10, 1, '[A-Z]'), 
        person_2_id = stringi::stri_rand_strings(10, 1, '[A-Z]'), 
        stringsAsFactors = F)
team_id_lookup <- data.frame(
        person_id = stringi::stri_rand_strings(5, 1, '[A-Z]'), 
        team_ids = floor(runif(5, min=0, max=500)), 
        stringsAsFactors = F)
coffeinjunky
  • 11,254
  • 39
  • 57
2

I'm not completely following the logic of having only 100 team values so I'm doing 200. But this is another approach.

set.seed(42)
person_ids <- data.frame(person_1_id = stringi::stri_rand_strings(100, 10, '[A-Z]'), 
                         person_2_id = stringi::stri_rand_strings(100, 10, '[A-Z]'), 
                                        stringsAsFactors = FALSE)

all_pid <- c(person_ids$person_1_id, person_ids$person_2_id)

team_ids <- floor(runif(200, min=0, max=500))
names(team_ids) <- all_pid

person_ids$team_id_1 <- team_ids[person_ids$person_1_id]
person_ids$team_id_2 <- team_ids[person_ids$person_2_id]
head(person_ids)

  person_1_id person_2_id team_id_1 team_id_2
1  XYHVQNTDRS  WBVOMAOSGV       128       207
2  LSYGMYZDMO  KODFEVCGUH       362       422
3  XDZYCNKXLV  WBIIDJMBZX        78       428
4  TVKRAVAFXP  ZEBDJOYQMC       157       225
5  JLAZLYXQZQ  YJJQSFPVZA       148       366
6  IJKUBTREGN  VFCRHVQNAH       339       337 
Elin
  • 6,507
  • 3
  • 25
  • 47