This is my dataset: when I filter for Actors column, I get a list of list (of 4 actors per movie)
head(movies$Actors)
[[1]] [1] "Rishab Shetty" " Sapthami Gowda" " Kishore Kumar G." [4] " Achyuth Kumar"
[[2]] [1] "Christian Bale" " Heath Ledger" " Aaron Eckhart" " Michael Caine"
[[3]] [1] "Elijah Wood" " Viggo Mortensen" " Ian McKellen"
[4] " Orlando Bloom"[[4]] [1] "Leonardo DiCaprio" " Joseph Gordon-Levitt" " Elliot Page"
[4] " Ken Watanabe"[[5]] [1] "Elijah Wood" " Ian McKellen" " Viggo Mortensen" [4] " Orlando Bloom"
[[6]] [1] "Elijah Wood" " Ian McKellen" " Orlando Bloom" " Sean Bean"
Since there are 5000 rows, there are way too many actors to use for one hot encoding. What I tried to do is find the top 20 actors (using sort() and table() ), and then to add a binary variable that states if a particular movie has any of the top e.g.20 actors in it, as this might be a simple proxy for whether the movie has good ratings.
Unfortunately, the code doesn't work. Can't seem to google my way out of this either. Can anyone help me?
## get 20 biggest actors in terms of number of movies
top20actorstable <- sort(table(actorlist), decreasing = T)[1:20]
names(top20actorstable)
## one hot encoding
top20actorsnames <- names(top20actorstable)
movies$bigactor <- NA
for (i in nrow(movies)){
listactors <- unlist(movies[i,]$Actors)
if (any(is.element(listactors, top20actorsnames))){
movies[i,]$bigactor <- 1
}
else {movies[i,]$bigactor <- 0}
}
Edit:
> dput(head(movies$Actors, 10))
list(c("Rishab Shetty", " Sapthami Gowda", " Kishore Kumar G.",
" Achyuth Kumar"), c("Christian Bale", " Heath Ledger", " Aaron Eckhart",
" Michael Caine"), c("Elijah Wood", " Viggo Mortensen", " Ian McKellen",
" Orlando Bloom"), c("Leonardo DiCaprio", " Joseph Gordon-Levitt",
" Elliot Page", " Ken Watanabe"), c("Elijah Wood", " Ian McKellen",
" Viggo Mortensen", " Orlando Bloom"), c("Elijah Wood", " Ian McKellen",
" Orlando Bloom", " Sean Bean"), c("Keanu Reeves", " Laurence Fishburne",
" Carrie-Anne Moss", " Hugo Weaving"), c("Mark Hamill", " Harrison Ford",
" Carrie Fisher", " Billy Dee Williams"), c("Arnold Schwarzenegger",
" Linda Hamilton", " Edward Furlong", " Robert Patrick"), c("Mark Hamill",
" Harrison Ford", " Carrie Fisher", " Alec Guinness"))
What I meant by "code doesn't work": I was hoping for the for loop to, one by one, check within the list of actors of each row, unlist them and check against the list of top20actors - if there is one of the top actors, then the bigactor column would be a 1, otherwise 0.
However, when I check the column after the for loop, it returns NA:
> for (i in nrow(movies)){
+ listactors <- unlist(movies[i,]$Actors)
+ if (any(is.element(listactors, top20actorsnames))){
+ movies[i,]$bigactor <- 1
+ }
+ else {movies[i,]$bigactor <- 0}
+ }
Warning: provided 11 variables to replace 10 variables
> movies$bigactor
NULL