3

I am trying to clean a data set and create 3 variables under the names: Adventure, Action and Comedy. The raw data set has 3000 observation (imported filename: dat). I am showing only few observations

id    Runtime        Genres                                       
37      75       animation, adventure, family, fantasy, musical   
1       162      action, adventure, fantasy, sci_fi       
95      126      action, fantasy   
100     101      comedy, drama, fantasy   
82      136      action, adventure, sci-fi    
99      117      animation, adventure, comedy, family, sport   
91      95       animation, comedy, crime, family

After importing the dataset in R separated all Genres into 5 using following R code:

dat1 <- dat %>% separate (Genres, c("Genres1","Genres2" ,"Genres3" ,"Genres4" ,"Genres5" ), sep=",", extra = "drop", fill = "right")


id    Runtime    Genres1    Genres2    Genres3  Genres4  Genres5                                       
37      75       animation  adventure  family   fantasy  musical   
1       162      action     adventure  fantasy  sci_fi       
95      126      action     fantasy   
100     101      comedy     drama      fantasy   
82      136      action     adventure  sci-fi    
99      117      animation  adventure  comedy   family   sport   
91      95       animation  comedy     crime    family

How do collapse all the genres into 1 category each for action, adventure, and comedy?

I tried using the following code:

created a empty column for adventure using

dat1 ["adventure"] <- NA

dat1$adventure <- ifelse(dat1$Genres1=="adventure",1,(ifelse(dat1$Genres2=="adventure",1,0))) 

After suggestion shortened the code to

  dat1$adventure <- ifelse((dat1$Genres1=="adventure" | dat1$Genres2=="adventure" | dat1$Genres3=="adventure" | dat1$Genres4=="adventure" ),1, 0)


id    Runtime    Genres1    Genres2    Genres3  Genres4  Genres5  Adventure                                     
37      75       animation  adventure  family   fantasy  musical  0
1       162      action     adventure  fantasy  sci_fi            0
95      126      action     fantasy                               0
100     101      comedy     drama      fantasy                    0
82      136      action     adventure  sci-fi                     0
99      117      animation  adventure  comedy   family   sport    0   
91      95       animation  comedy     crime    family            0

The code was able to extract adventure for Genres1 but returned zero for Genres2.

I have reedited the question. I tried things suggested but not sure how to go about it as there are 3000 observation.

After running suggestion

list of genres, formation of vectors and assigning it to dat2

dat2 <- c( "adventure", "comedy", "action", "drama", "animation", "fantasy", "mystery", "family", "sci-fi", "thriller", "romance", "horror", "musical","history", "war", "documentary", "biography")

table(factor( dat2 )) table(factor( dat2 ))

 action   adventure   animation   biography      comedy documentary          drama 
      1           1           1           1           1           1           1 
 family     fantasy     history      horror     musical     mystery     romance 
      1           1           1           1           1           1           1 
 sci-fi    thriller         war 
      1           1           1                                                                   

creating the function

 fun1 <- function("adventure", "comedy", "action", "drama", "animation",
"fantasy", "mystery", "family", "sci-fi", "thriller", "romance", "horror", 
"musical","history", "war", "documentary", "biography")) {
 vector_of_cur_genres <- seperate(i, sep = ", ")
 result <- table(factor(vector_of_cur_genres, dat2))
 return(result)
 }  

  # Results         

 fun1 <- function("adventure", "comedy", "action", "drama", 
 "animation", "fantasy", "mystery", "family", "sci-fi", "thriller",  
 "romance", "horror", "musical","history", "war", "documentary", 
 "biography")) {
  Error: unexpected string constant in "fun1 <- function("adventure""
  >   vector_of_cur_genres <- separate(i, sep = ", ")
  Error: Please supply column name
  >   result <- table(factor(vector_of_cur_genres, dat2))
  Error in factor(vector_of_cur_genres, dat2) : 
  object 'vector_of_cur_genres' not found
  >   return(result)
  Error: no function to return from, jumping to top level
   > }
   Error: unexpected '}' in "}"

  mat <- mapply(fun1,dat2$Genres)
       Error in match.fun(FUN) : object 'fun1' not found                                                                                                                                                                                                        
  • FYI, there’s no need to create an empty new column before assigning to it: the assignment creates it anyway. – Konrad Rudolph Jul 26 '16 at 13:45
  • Welcome to Stack Overflow! [How to make a great R reproducible example?](http://stackoverflow.com/questions/5963269) – zx8754 Jul 26 '16 at 13:47
  • Possibly, convert your data from wide to long, then table summary. – zx8754 Jul 26 '16 at 13:48
  • 1
    See also: [Split comma-separated column into separate rows](http://stackoverflow.com/questions/13773770/split-comma-separated-column-into-separate-rows) – Jaap Jul 26 '16 at 13:56
  • As a simplification, this can be shortened into a single `ifelse` function: `ifelse((dat1$Genres1=="adventure" | dat1$Genres2=="adventure"),1, 0)` – lmo Jul 26 '16 at 13:57
  • The code returns output for Genres1 as 1 but for other Genres(2-5) returns NA. – Suchit Kumbhare Jul 26 '16 at 14:09

1 Answers1

0

You can use a mixture of table and factor to get what you want. First you want to make sure that all of the Genres as spelled exactly the same each time ("Adventure" != "adventure"). Then you should create a vector with all of the possible genres c("Adventure", "Comedy", "Drama", ...").

For each row you then call table(factor(genres, list_of_possible_genres)) and it will return a table of counts. You can then construct a matrix with something like this

mat <- mapply(
    function(i) {
        table(factor(separate(i, ...),list_of_possible_genres))
    },df$Genres)
#you want to use the original Data.Frame after import

new.df <- cbind(df,mat) #they should both have the same number of rows here

make the ... in the separate call the same as in your original function. If you have any questions as to what of the individual functions or steps does, I can explain in the comments.

I define a function right in the mapply call function (i) ... this is similar to defining a lambda in Python. That function takes in a string of genres and returns a named vector of counts of how many times each possible genre appeared.

EDIT:

fun1 <- function(string_of_genres)) {
    vector_of_cur_genres <- seperate(i, sep = ", ")
    result <- table(factor(vector_of_cur_genres, list_of_possible_genres))
    return(result)
}
mat <- mapply(fun1,df$Genres)
Adam
  • 648
  • 6
  • 18
  • @ Adam: I am a beginner to R. do you want to work on the raw imported data frame for this steps? Could you please explain the matrix function and cbind? – Suchit Kumbhare Jul 26 '16 at 15:19
  • `cbind` is the easy one. What it does is to take a bunch of matrices or data.frames and attaches the columns to each other. So what will happen in the call `cbind(df,mat)` is that the data.frame will have the columns of the matrix tagged on to the end. `mapply` is a vectorization function which means it takes a vector, matrix or list and then applies the given function to it and then gives back the results from each of the function calls. `mapply` is part of the `*apply` family of functions on which there is a lot of literature about explaining their nuances and differences. – Adam Jul 26 '16 at 16:01
  • check my edits. You do want to call this on the data.frame from the first step. before you split the data. If you take a look, you will be splitting the data in the code elsewhere – Adam Jul 26 '16 at 16:05
  • Thats like a `lambda` call. One of the parameters to mapply is supposed to be a function, but instead of writing, naming, and passing that function, I skip the naming step and write the function inside of the mapply call. `function (i)` and everything until the comma after `list_of_genres` is the new function that was defined to be used by mapply. That function will be called on each element in the list that was passed to it as a second argument. – Adam Jul 26 '16 at 16:22
  • see my edit separating out the function naming. Hopefully that makes more sense – Adam Jul 26 '16 at 16:25
  • Thank you for guiding me. This is so complicated and above my knowledge. I was able to create vectors but where do you insert raw data in whole process when you create fun1? – Suchit Kumbhare Jul 26 '16 at 17:16
  • In `mat <- mapply(fun1, dat$Genres)` The second argument should be the **unseparated** Genres. The ones that you have in the dat structure right after import. fun1 is just a function that you call ~3000 times when you apply it to each element of dat$Genres – Adam Jul 26 '16 at 17:35
  • after defining fun1, try running `mat <- mapply(fun1, dat$Genres)` and see what it spits out. Put the call and results into your questions if it doesn't work and we can go from there – Adam Jul 26 '16 at 17:37
  • It did not work, will let everyone know if things work – Suchit Kumbhare Jul 28 '16 at 18:57