9

Suppose I have a large data.table that looks like dt below.

dt <- data.table(
  player_1 = c("a", "b", "b", "c"),
  player_1_age = c(10, 20, 20, 30),
  player_2 = c("b", "a", "c", "a"),
  player_2_age = c(20, 10, 30, 10)
)
# dt
#    player_1 player_1_age player_2 player_2_age
# 1:        a           10        b           20
# 2:        b           20        a           10
# 3:        b           20        c           30
# 4:        c           30        a           10

From the dt above, I would like to create a data.table with unique players and their age like the following, player_dt:

# player_dt
# player  age
#      a   10
#      b   20
#      c   30

To do so, I've tried the code below, but it takes too long on my larger dataset, probably because I am creating a data.table for each iteration of sapply.

How would you get the player_dt above, while checking for each player that there is only one unique age value?

# get unique players
player <- sort(unique(c(dt$player_1, dt$player_2)))

# for each player, get their age, if there is only one age value
age <- sapply(player, function(x) {
  unique_values <- unique(c(
    dt[player_1 == x][["player_1_age"]],
    dt[player_2 == x][["player_2_age"]]))
  if(length(unique_values) > 1) stop() else return(unique_values)
})

# combine to create the player_dt
player_dt <- data.table(player, age)
johnny
  • 571
  • 4
  • 14
  • 1
    you are probalby looking for some form of pivot longer – user12256545 May 02 '20 at 17:55
  • 1
    Can you show the expected output if there would be, say, `b = 15` for `player_2` or a single `d = 5` for some other player? – markus May 02 '20 at 17:56
  • Thanks @markus! If `b = 15` or `d = 5`, I would like for the function to stop, so I can correct the error in the larger dataset. In the larger dataset, each player **should** be matched with only one age value. Basically, I'd like to find a way to identify such errors. Do you think maybe I should do this in two steps, (1) identify errors in the larger dataset; then (2) create the unique player data, `player_dt`? – johnny May 02 '20 at 18:04
  • 1
    @user12256545 Could you elaborate on what do you mean by pivot "longer"? – johnny May 02 '20 at 18:24
  • 1
    Pivot longer is some new name given to pivot or unpivot (SQL terms) or melt or cast (reshape terms). – jangorecki May 02 '20 at 20:59

2 Answers2

7

I use the data from @DavidT as input.

dt
#   player_1 player_1_age player_2 player_2_age
#1:        a           10        b           20
#2:        b           20        a           10
#3:        b           20        c           30
#4:        c           30        a           11 # <--

TL;DR

You can do

nm <- names(dt)
idx <- endsWith(nm, "age")
colsAge <- nm[idx]
colsOther <- nm[!idx]

out <-
  unique(melt(
    dt,
    measure.vars = list(colsAge, colsOther),
    value.name = c("age", "player")
  )[, .(age, player)])[, if (.N == 1) # credit: https://stackoverflow.com/a/34427944/8583393
    .SD, by = player]
out
#   player age
#1:      b  20
#2:      c  30

Step-by-step

What you can to do is to melt multiple columns simultaneously - those that end with "age" and those that don't.

nm <- names(dt)
idx <- endsWith(nm, "age")
colsAge <- nm[idx]
colsOther <- nm[!idx]
dt1 <- melt(dt, measure.vars = list(colsAge, colsOther), value.name = c("age", "player"))

The result is

dt1
#   variable age player
#1:        1  10      a
#2:        1  20      b
#3:        1  20      b
#4:        1  30      c
#5:        2  20      b
#6:        2  10      a
#7:        2  30      c
#8:        2  11      a

Now we call unique ...

out <- unique(dt1[, .(age, player)])
out
#   age player
#1:  10      a
#2:  20      b
#3:  30      c
#4:  11      a

... and filter for groups of "player" with length equal to 1

out <- out[, if(.N == 1) .SD, by=player]
out
#   player age
#1:      b  20
#2:      c  30

Given OP's input data, that last step is not needed.

data

library(data.table)
dt <- data.table(
  player_1 = c("a", "b", "b", "c"),
  player_1_age = c(10, 20, 20, 30),
  player_2 = c("b", "a", "c", "a"),
  player_2_age = c(20, 10, 30, 11)
)

Reference: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html

markus
  • 25,843
  • 5
  • 39
  • 58
  • I learned so much from your answer. Thank you! With `if (.N == 1) .SD` you are supposed to get two **subsetS** of data.table whose number of observations = 1, right? Do you know why the two subsets are combined to form a two-row data.table, i.e., your `out`, as opposed to two one-row data.tables? That is, why don't we get row 1 of your `out` and row 2 of your `out` separately? Just curious how the output from `.SD` works when you use `by` – johnny May 02 '20 at 20:02
  • 1
    @johnc Glad to hear that answer was helpful for you. Regarding the `.SD`, this answer might be helpful for you: https://stackoverflow.com/a/8509301/8583393 – markus May 03 '20 at 12:06
2

I've altered your data so that there's at least one error to catch:

library(tidyverse)

dt <- tibble(
  player_1 = c("a", "b", "b", "c"),
  player_1_age = c(10, 20, 20, 30),
  player_2 = c("b", "a", "c", "a"),
  player_2_age = c(20, 10, 30, 11)
)
  # Get the Names columns and the Age columns
colName <- names(dt)
ageCol <- colName[str_detect(colName, "age$")]
playrCol <- colName[! str_detect(colName, "age$")]

  # Gather the Ages
ages <- dt %>% 
  select(ageCol) %>% 
  gather(player_age, age)

  # Gather the names
names <- dt %>% 
  select(playrCol ) %>% 
  gather(player_name, name)

  # Bind the two together, and throw out the duplicates
  # If there are no contradictions, this is what you want.
allNameAge <- cbind( names, ages) %>% 
  select(name, age) %>% 
  distinct() %>% 
  arrange(name)

  # But check for inconsistencies.  This should leave you with
  # an empty tibble, but instead it shows the error.
inconsistencies <- allNameAge %>% 
  group_by(name) %>% 
  mutate(AGE.COUNT = n_distinct(age)) %>% 
  filter(AGE.COUNT > 1) %>% 
  ungroup()

This should extends to more name/age column pairs.

David T
  • 1,993
  • 10
  • 18