4

My dataset looks exactly like this (only with much more observations and attributes): Dataset.xlsx

I want each attribute to have a binary variable with the value 1 if PersonX has it and otherwise 0, but each person must also include the attributes of others. It should look like this, where of course if Person1 has also the same attribute as Person2, the variable shouldn't be generated again :

ID       Class_Label    A469 T593 K022K A835 Z935 U83F W5326
Person1  TRUE           1    1    1     0    0    0    0
Person2  FALSE          0    1    0     1    1    0    0
Person3  FALSE          0    0    1     0    0    1    1

As you can see Person1 and Person3 have the attribute: K022K in common and Person1 and Person2 T593. Is there any way how to solve this?

Community
  • 1
  • 1
Textime
  • 89
  • 8
  • 1
    How many attributes do you have? There are elegant ways of handling any number of these... and then there's the reality of data wrangling for finite datasets where you kind of just make it work. – James Wang Mar 19 '19 at 17:19
  • I have like 100k observations and in total approximately 3000 possible attributes. But each Person has approx. 300 of those 3000 attributes. So it is really difficult for me to count all attributes. – Textime Mar 19 '19 at 17:22
  • Ok, elegant it is—give me a second, I've done something similar before. – James Wang Mar 19 '19 at 17:25

2 Answers2

5
library(tidyverse)

df <- tibble(
  id = paste0("Person", 1:3),
  class_label = c(TRUE, FALSE, FALSE),
  attribute = c("A469/T593/K022K", "A835/Z935/T593", "U835F/W5326/K022K")
)
df
#> # A tibble: 3 x 3
#>   id      class_label attribute        
#>   <chr>   <lgl>       <chr>            
#> 1 Person1 TRUE        A469/T593/K022K  
#> 2 Person2 FALSE       A835/Z935/T593   
#> 3 Person3 FALSE       U835F/W5326/K022K

df %>%
  separate_rows(attribute, sep = "/") %>%
  mutate(i = 1) %>%
  spread(attribute, i, fill = 0)
#> # A tibble: 3 x 9
#>   id      class_label  A469  A835 K022K  T593 U835F W5326  Z935
#>   <chr>   <lgl>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Person1 TRUE            1     0     1     1     0     0     0
#> 2 Person2 FALSE           0     1     0     1     0     0     1
#> 3 Person3 FALSE           0     0     1     0     1     1     0

Taking note that what you are trying to do is frequently referred to as one-hot encoding or binary encoding. In addition, you may want to note that df %>% separate_rows(attribute, sep = "/") your data is in a tidy format which could have other benefits.

UPDATE: To extend to more columns, you might want to first determine which attributes you will be encoding. So something like, select(df, contains("attribute")) or select(df, 3:4).

df <- tibble(
  id = paste0("Person", 1:3),
  class_label = c(TRUE, FALSE, FALSE),
  attribute = c("A469/T593/K022K", "A835/Z935/T593", "U835F/W5326/K022K"),
  attribute2 = c("one/two/three", "four/five/six", "one/five/six")
)
df
#> # A tibble: 3 x 4
#>   id      class_label attribute         attribute2   
#>   <chr>   <lgl>       <chr>             <chr>        
#> 1 Person1 TRUE        A469/T593/K022K   one/two/three
#> 2 Person2 FALSE       A835/Z935/T593    four/five/six
#> 3 Person3 FALSE       U835F/W5326/K022K one/five/six

one_hot <- function(data, att) {
  quo_att <- enquo(att)
  data %>%
    select(id, class_label, !! quo_att) %>% 
    separate_rows(!! quo_att, sep = "/") %>%
    mutate(i = 1) %>%
    spread(!! quo_att, i, fill = 0) %>%
    select(-id, -class_label)
}


attributes_to_map <- select(df, contains("attribute")) %>% names
attributes_to_map
#> [1] "attribute"  "attribute2"

attributes_to_map %>%
  map_dfc(~ one_hot(df, .)) %>%
  bind_cols(select(df, id, class_label)) %>%
  select(id, class_label, everything())
#> # A tibble: 3 x 15
#>   id    class_label  A469  A835 K022K  T593 U835F W5326  Z935  five  four
#>   <chr> <lgl>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Pers~ TRUE            1     0     1     1     0     0     0     0     0
#> 2 Pers~ FALSE           0     1     0     1     0     0     1     1     1
#> 3 Pers~ FALSE           0     0     1     0     1     1     0     1     0
#> # ... with 4 more variables: one <dbl>, six <dbl>, three <dbl>, two <dbl>

But at this point, you may also want to consider the recipes package or do searches on one-hot encoding for several variables.

JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116
  • Thank you Jason, but I have like 3000 attributes, so it would be really time consuming to perform your code. In addition, some persons ID is like "i836d", I should have mentioned that. – Textime Mar 19 '19 at 17:41
  • I found another use for your code. Dude you are a life saver! – Textime Mar 21 '19 at 08:34
2

Ok, using your example as minimal.xlsx:

Minimal example CSV

install.packages('readxl')  # if you don't have this already
library(readxl)

example <- read_excel('./minimal.xlsx')  # assuming file is in working directory
example$Attribute <- as.character(example$Attribute)  # convert to character

attrs <- strsplit(example$Attribute, '/')  # split by /
attrs <- unlist(attrs)  # flatten the list
attrs <- unique(attrs)  # extract uniques

for (attr in attrs) {
  attr_row <- grepl(attr, example$Attribute)  # boolean of True/false
  attr_row <- attr_row * 1  # convert to 0, 1
  example[attr] <- attr_row
}

I tried to explain it in the comments, but in essence:

  • Turn the Attributes into characters and split them on your denoting character
  • Combine them into a "set" vector of unique Attributes
  • Loop through them, and generate each row
  • Append each row back into the DataFrame

The result is here:

Final output

You can delete the original Attribute column afterward too, but that should get you what you want, is a generalized solution, and doesn't require outside libraries.

Edit: The other answer is shorter and definitely use it for quick wrangling on this, personally I often like to use base R if I can for small tasks like this, especially for scripts I want to share with others.

James Wang
  • 1,281
  • 9
  • 17
  • Thanks! I will try that one tomorrow, but this looks great! Does it matter if it is a .xlsx file? – Textime Mar 19 '19 at 17:44
  • No, you should be able to use `read.xlxs`—I'll try it to make sure, but I'm pretty sure it makes no difference. – James Wang Mar 19 '19 at 17:45
  • Ok, I just converted it into Excel format—annoyingly, I couldn't actually get `xlsx` working because of RJava problems, but a less fiddly version is `readxl` anyway as per this: https://stackoverflow.com/questions/7049272/importing-excel-files-into-r-xlsx-or-xls – James Wang Mar 19 '19 at 17:53
  • 1
    Thank you James! This solved my problem and the code is so elegant! I wish I could spend you one coffee :-) – Textime Mar 20 '19 at 13:54
  • Glad to hear it—good luck on the research study/project/etc.! – James Wang Mar 20 '19 at 18:08