Convert multiple value fields to factor

Question

Reading input from a csv file leaves me with an odd field containing multiple values e.g.

 Title                Genres
1     A [Item1, Item2, Item3]
2     B                      
3     C        [Item4, Item1]


df <- data.frame(c("A","B","C"), c("[Item1, Item2, Item3]","","[Item4, Item1]"), 
           stringsAsFactors = FALSE)
colnames(df) <- c("Title","Genres")

A function to retrieve the individual tokens

extractGenre <- function(genreVector){
  strsplit(substring(genreVector,2,nchar(genreVector)-1),", ")
}

I am a bit lost on how to convert Item 1,... Item 4 into factors and append them to the dataframe. While apply lets me execute the function on each row, how would the next step look like?

Perhaps, this Q is helpful [Split comma-separated strings in a column into separate rows](https://stackoverflow.com/q/13773770/3817004)? — Uwe, Aug 09 '18 at 17:10

score 1 · Answer 1 · answered Aug 09 '18 at 17:12

I'm not sure if this is exactly what you are looking for, but I approached it a bit differently. I used dplyr and grepl:

    df <- data.frame(c("A","B","C"), c("[Item1, Item2, Item3]","","[Item4, Item1]"), 
                     stringsAsFactors = FALSE)
    colnames(df) <- c("Title","Genres")
    df
    df1<-df%>%
      mutate(Item1 = ifelse(grepl("Item1",Genres), T,F),
             Item2 = ifelse(grepl("Item2",Genres), T,F),
             Item3 = ifelse(grepl("Item3",Genres), T,F),
             Item4 = ifelse(grepl("Item4",Genres), T,F))

 Title                Genres Item1 Item2 Item3 Item4
1     A [Item1, Item2, Item3]  TRUE  TRUE  TRUE FALSE
2     B                       FALSE FALSE FALSE FALSE
3     C        [Item4, Item1]  TRUE FALSE FALSE  TRUE

Hopefully this helps

score 1 · Accepted Answer · answered Aug 09 '18 at 17:13

library(dplyr)
library(tidyr)

df %>% mutate(Genres=gsub('\\[|\\]|\\s+','',Genres)) %>%  #remove []
       separate(Genres,paste0('Gen',1:3)) %>%             #separate Genres to multiple columns
       gather(key,Genres,-Title) %>% select(-key) %>%     #Gather to Genres columns
       filter(!is.na(Genres)) %>% arrange(Title,Genres) %>%    #filter and arrange
       mutate(Genres=as.factor(Genres))     


   Title Genres
1     A  Item1
2     A  Item2
3     A  Item3
4     B       
5     C  Item1
6     C  Item4

score 0 · Answer 3 · answered Aug 09 '18 at 17:13

You can used the function separate() as Uwe proposed, but it seems that the order of your Genre is not always the same. One option is to create new column with mutate(), and use the function grepl() in order to identify if each tokens is present.

df %>% 
    mutate(
        Item1 = grepl('Item1', Genres),
        Item2 = grepl('Item2', Genres),
        Item3 = grepl('Item3', Genres),
        Item4 = grepl('Item4', Genres)
    )

#   Title                Genres Item1 Item2 Item3 Item4
# 1     A [Item1, Item2, Item3]  TRUE  TRUE  TRUE FALSE
# 2     B                       FALSE FALSE FALSE FALSE
# 3     C        [Item4, Item1]  TRUE FALSE FALSE  TRUE

Convert multiple value fields to factor

3 Answers3