0

Reading input from a csv file leaves me with an odd field containing multiple values e.g.

 Title                Genres
1     A [Item1, Item2, Item3]
2     B                      
3     C        [Item4, Item1]


df <- data.frame(c("A","B","C"), c("[Item1, Item2, Item3]","","[Item4, Item1]"), 
           stringsAsFactors = FALSE)
colnames(df) <- c("Title","Genres")

A function to retrieve the individual tokens

extractGenre <- function(genreVector){
  strsplit(substring(genreVector,2,nchar(genreVector)-1),", ")
} 

I am a bit lost on how to convert Item 1,... Item 4 into factors and append them to the dataframe. While apply lets me execute the function on each row, how would the next step look like?

Kilian
  • 1,540
  • 16
  • 28

3 Answers3

1

I'm not sure if this is exactly what you are looking for, but I approached it a bit differently. I used dplyr and grepl:

    df <- data.frame(c("A","B","C"), c("[Item1, Item2, Item3]","","[Item4, Item1]"), 
                     stringsAsFactors = FALSE)
    colnames(df) <- c("Title","Genres")
    df
    df1<-df%>%
      mutate(Item1 = ifelse(grepl("Item1",Genres), T,F),
             Item2 = ifelse(grepl("Item2",Genres), T,F),
             Item3 = ifelse(grepl("Item3",Genres), T,F),
             Item4 = ifelse(grepl("Item4",Genres), T,F))

 Title                Genres Item1 Item2 Item3 Item4
1     A [Item1, Item2, Item3]  TRUE  TRUE  TRUE FALSE
2     B                       FALSE FALSE FALSE FALSE
3     C        [Item4, Item1]  TRUE FALSE FALSE  TRUE

Hopefully this helps

Silentdevildoll
  • 1,187
  • 1
  • 6
  • 12
1
library(dplyr)
library(tidyr)

df %>% mutate(Genres=gsub('\\[|\\]|\\s+','',Genres)) %>%  #remove []
       separate(Genres,paste0('Gen',1:3)) %>%             #separate Genres to multiple columns
       gather(key,Genres,-Title) %>% select(-key) %>%     #Gather to Genres columns
       filter(!is.na(Genres)) %>% arrange(Title,Genres) %>%    #filter and arrange
       mutate(Genres=as.factor(Genres))     


   Title Genres
1     A  Item1
2     A  Item2
3     A  Item3
4     B       
5     C  Item1
6     C  Item4              
A. Suliman
  • 12,923
  • 5
  • 24
  • 37
0

You can used the function separate() as Uwe proposed, but it seems that the order of your Genre is not always the same. One option is to create new column with mutate(), and use the function grepl() in order to identify if each tokens is present.

df %>% 
    mutate(
        Item1 = grepl('Item1', Genres),
        Item2 = grepl('Item2', Genres),
        Item3 = grepl('Item3', Genres),
        Item4 = grepl('Item4', Genres)
    )

#   Title                Genres Item1 Item2 Item3 Item4
# 1     A [Item1, Item2, Item3]  TRUE  TRUE  TRUE FALSE
# 2     B                       FALSE FALSE FALSE FALSE
# 3     C        [Item4, Item1]  TRUE FALSE FALSE  TRUE
demarsylvain
  • 2,103
  • 2
  • 14
  • 33