0

I have a quite interesting problem that I am looking a smart and possible efficient solution for.

I have a data frame that looks like this.

# A tibble: 6 x 6
  track_id   tag1   tag2  tag3   tag4   tag5
     <int>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>
1  1550729  54087 109806 46869 183258  54337
2  1184201 201327   3668 46208 205245 189631
3  3763749 194264 194413  3424  91383 205245
4  2674608 198998 107401  2327   4425 107398
5  1999180  54087   4425 75574 239459   2397
6  3048820  11242 205245  2474  11056  72354

What I would like is to keep the track_id in the first row but to explode the tags, and where one track has the particular ID I would like to put a true value, namely one.

To be clearer, let's suppose I start from a smaller one:

  track_id   tag1   tag2 
1  1550729  54087 109806 
2  1184201 201327   3668 

After the transformation I would like to get at something like

  track_id   54087   109806  201327   3668   
1  1550729       1        1       0      0
2  1184201       0        0       1      1

Is this something possible quickly or I should roll out a solution by hand?

www
  • 38,575
  • 12
  • 48
  • 84
Siscia
  • 1,421
  • 1
  • 12
  • 29
  • Search for "dummy variable" and/or "one-hot encoding" – Hong Ooi Oct 02 '17 at 21:12
  • First convert the `tag` variables to factors and then `model.matrix(~ . + 0, data = your_data_frame)`. Suggested dupes: https://stackoverflow.com/q/11952706/903061, https://stackoverflow.com/q/24142576/903061 – Gregor Thomas Oct 02 '17 at 21:13

3 Answers3

6

A solution from dplyr and tidyr.

library(dplyr)
library(tidyr)

dt2 <- dt %>%
  gather(tag, value, -track_id) %>%
  select(-tag) %>%
  mutate(Occurrence = 1) %>%
  spread(value, Occurrence, fill = 0)

DATA

dt <- read.table(text = "  track_id   tag1   tag2  tag3   tag4   tag5
1  1550729  54087 109806 46869 183258  54337
2  1184201 201327   3668 46208 205245 189631
3  3763749 194264 194413  3424  91383 205245
4  2674608 198998 107401  2327   4425 107398
5  1999180  54087   4425 75574 239459   2397
6  3048820  11242 205245  2474  11056  72354",
                 header = TRUE)
www
  • 38,575
  • 12
  • 48
  • 84
4

Using melt() and dcast() from the data.table package, this becomes a "one-liner":

library(data.table)
melt(setDT(df), id.vars = "track_id")[, dcast(.SD, track_id ~ value, length)]
   track_id 2327 2397 2474 3424 3668 4425 11056 11242 46208 46869 54087 54337 72354 75574
1:  1184201    0    0    0    0    1    0     0     0     1     0     0     0     0     0
2:  1550729    0    0    0    0    0    0     0     0     0     1     1     1     0     0
3:  1999180    0    1    0    0    0    1     0     0     0     0     1     0     0     1
4:  2674608    1    0    0    0    0    1     0     0     0     0     0     0     0     0
5:  3048820    0    0    1    0    0    0     1     1     0     0     0     0     1     0
6:  3763749    0    0    0    1    0    0     0     0     0     0     0     0     0     0
   91383 107398 107401 109806 183258 189631 194264 194413 198998 201327 205245 239459
1:     0      0      0      0      0      1      0      0      0      1      1      0
2:     0      0      0      1      1      0      0      0      0      0      0      0
3:     0      0      0      0      0      0      0      0      0      0      0      1
4:     0      1      1      0      0      0      0      0      1      0      0      0
5:     0      0      0      0      0      0      0      0      0      0      1      0
6:     1      0      0      0      0      0      1      1      0      0      1      0
Uwe
  • 41,420
  • 11
  • 90
  • 134
3

Maybe something like the following.

dat <- read.table(text = "
track_id   tag1   tag2 
1  1550729  54087 109806 
2  1184201 201327   3668 
", header = TRUE)
dat

molten <- reshape2::melt(dat, id.vars = "track_id")
xtabs(~ track_id + value, molten)
#         value
#track_id  3668 54087 109806 201327
#  1184201    1     0      0      1
#  1550729    0     1      1      0
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66