0

I have a dataset with a column "genre" that often has multiple genres split by "|". For example:

   Movie Genre 
    M1   Comedy|Drama
    M2   Romance|Drama|Sci-fi

I would like to separate these genres into binary columns so that the genre column turns into multiple columns as so:

   Movie Comedy Drama Romance Sci-fi
    M1     1     1      0      0  
    M2     0     1      0      1
Sam P
  • 113
  • 1
  • 4
  • 8
  • What language do you want this done in? – user1357015 Jul 17 '17 at 01:13
  • This is a two step process. First [split the column by |](https://stackoverflow.com/questions/7069076/split-column-at-delimiter-in-data-frame) and then [convert it into a boolean matrix](https://stackoverflow.com/questions/22566592/convert-a-dataframe-to-presence-absence-matrix). – Ronak Shah Jul 17 '17 at 01:15
  • I would like it done in R – Sam P Jul 17 '17 at 01:16

2 Answers2

0

You can split the Genre column using strsplit but be sure to double-escape the special character "|". For example:

dat <- data.frame(Movie = c("M1", "M2"), 
                  Genre = c("Comedy|Drama", "Romance|Drama|Sci-fi"), 
                  stringsAsFactors = FALSE)
genre_list <- strsplit(dat$Genre, split = "\\|")
unique_genres <- unique(unlist(genre_list, use.names = FALSE))
binary_genres <- t(sapply(genre_list, function(e) unique_genres %in% e))
mode(binary_genres) <- "integer"
colnames(binary_genres) <- unique_genres
out <- cbind(dat[1], binary_genres)
out

This gives the result as a data frame with binary response variables

Movie Comedy Drama Romance Sci-fi
M1      1     1       0      0
M2      0     1       1      1
Shaun Wilkinson
  • 473
  • 1
  • 4
  • 11
0

You can also try to separate_rows by | and spread the dataframe using the tidyr package:

library(tidyr)
df %>% 
  separate_rows(Genre, sep = "[|]") %>%
  mutate(Value = 1) %>%
  spread(Genre, Value) %>%
  mutate_at(vars(2:5), funs(coalesce(., 0)))

This gives:

  Movie Comedy Drama Romance Sci-fi
1    M1      1     1       0      0
2    M2      0     1       1      1
HNSKD
  • 1,614
  • 2
  • 14
  • 25