2

I have a data frame where id can have multiple event types

> head(eventtype, 10)
      id    event_type
1   6597 event_type 11
2   8011 event_type 15
3   2597 event_type 15
4   5022 event_type 15
5   5022 event_type 11
6   6852 event_type 11
7   6852 event_type 15
8   5611 event_type 15
9  14838 event_type 15
10 14838 event_type 11

I want to convert it into a format

   id event_type 1 event_type 2 event_type 3 ... event_type 50 
14838            0            0            0 ...             0 

What is the best way to achieve this in R? Is there any package? I have tried using dummies:

new_my_data <- dummy.data.frame(eventtype, names = c("event_type1", "event_type2", "event_type3", "event_type4", "event_type5")

but it doesn't work. I tried to search as well but could see a solution to this specific problem. Nearly all posts assume that one hot encoding is known to all.

Please help.

Roman
  • 4,744
  • 2
  • 16
  • 58
Bak_was
  • 45
  • 2
  • 8
  • Read this: http://stackoverflow.com/questions/11952706/generate-a-dummy-variable – PKumar Apr 25 '17 at 04:15
  • There is also a package called caret , you can use a dummyVars to create dummy variables. https://inclass.kaggle.com/c/15-071x-the-analytics-edge-summer-2015/forums/t/15494/dummy-variable-creation-over-categorical-variable . – PKumar Apr 25 '17 at 04:18
  • 6
    `library(tidyverse); df %>% mutate(i = 1) %>% spread(event_type, i, fill = 0)` – alistaire Apr 25 '17 at 04:18
  • Is this useful: http://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format ? – jogo Apr 25 '17 at 06:21
  • @alistaire thanks ,it does the job :) but did you mean library(tidyr) and library(dplyr) instead of tidyverse ?? – Bak_was Apr 25 '17 at 07:25
  • `library(tidyverse)` loads both dplyr and tidyr, plus a few others (tibble, readr, ggplot2, purrr). If you like loading them separately, that's fine, though. – alistaire Apr 25 '17 at 07:31

1 Answers1

1

Easy with mltools and data.table:

> result
       id event_type_event_type 10 event_type_event_type 11 event_type_event_type 12 event_type_event_type 13 event_type_event_type 14
  1: 1274                        0                        0                        0                        0                        0
  2: 7668                        0                        0                        0                        0                        1
  3:  545                        1                        0                        0                        0                        0
  4: 5614                        0                        0                        0                        0                        0
  5: 9376                        0                        0                        0                        0                        0

Code

set.seed(1701)
df <- data.frame(id = sample(1:10000, 500, replace = TRUE),
                 event_type = paste("event_type", sample(10:20, 500, replace = TRUE)))
dt <- as.data.table(df)
result <- one_hot(dt)
Roman
  • 4,744
  • 2
  • 16
  • 58