2

This is my dataset: when I filter for Actors column, I get a list of list (of 4 actors per movie)

head(movies$Actors)

[[1]] [1] "Rishab Shetty" " Sapthami Gowda" " Kishore Kumar G." [4] " Achyuth Kumar"

[[2]] [1] "Christian Bale" " Heath Ledger" " Aaron Eckhart" " Michael Caine"

[[3]] [1] "Elijah Wood" " Viggo Mortensen" " Ian McKellen"
[4] " Orlando Bloom"

[[4]] [1] "Leonardo DiCaprio" " Joseph Gordon-Levitt" " Elliot Page"
[4] " Ken Watanabe"

[[5]] [1] "Elijah Wood" " Ian McKellen" " Viggo Mortensen" [4] " Orlando Bloom"

[[6]] [1] "Elijah Wood" " Ian McKellen" " Orlando Bloom" " Sean Bean"

Since there are 5000 rows, there are way too many actors to use for one hot encoding. What I tried to do is find the top 20 actors (using sort() and table() ), and then to add a binary variable that states if a particular movie has any of the top e.g.20 actors in it, as this might be a simple proxy for whether the movie has good ratings.

Unfortunately, the code doesn't work. Can't seem to google my way out of this either. Can anyone help me?

## get 20 biggest actors in terms of number of movies 
top20actorstable <- sort(table(actorlist), decreasing = T)[1:20]
names(top20actorstable)
## one hot encoding 
top20actorsnames <- names(top20actorstable)

movies$bigactor <- NA

for (i in nrow(movies)){
  listactors <- unlist(movies[i,]$Actors)
  if (any(is.element(listactors, top20actorsnames))){
    movies[i,]$bigactor <- 1 
  }
  else {movies[i,]$bigactor <- 0}
}

Edit:

> dput(head(movies$Actors, 10))
list(c("Rishab Shetty", " Sapthami Gowda", " Kishore Kumar G.", 
" Achyuth Kumar"), c("Christian Bale", " Heath Ledger", " Aaron Eckhart", 
" Michael Caine"), c("Elijah Wood", " Viggo Mortensen", " Ian McKellen", 
" Orlando Bloom"), c("Leonardo DiCaprio", " Joseph Gordon-Levitt", 
" Elliot Page", " Ken Watanabe"), c("Elijah Wood", " Ian McKellen", 
" Viggo Mortensen", " Orlando Bloom"), c("Elijah Wood", " Ian McKellen", 
" Orlando Bloom", " Sean Bean"), c("Keanu Reeves", " Laurence Fishburne", 
" Carrie-Anne Moss", " Hugo Weaving"), c("Mark Hamill", " Harrison Ford", 
" Carrie Fisher", " Billy Dee Williams"), c("Arnold Schwarzenegger", 
" Linda Hamilton", " Edward Furlong", " Robert Patrick"), c("Mark Hamill", 
" Harrison Ford", " Carrie Fisher", " Alec Guinness"))

What I meant by "code doesn't work": I was hoping for the for loop to, one by one, check within the list of actors of each row, unlist them and check against the list of top20actors - if there is one of the top actors, then the bigactor column would be a 1, otherwise 0.

However, when I check the column after the for loop, it returns NA:

> for (i in nrow(movies)){
+   listactors <- unlist(movies[i,]$Actors)
+   if (any(is.element(listactors, top20actorsnames))){
+     movies[i,]$bigactor <- 1 
+   }
+   else {movies[i,]$bigactor <- 0}
+ }
Warning: provided 11 variables to replace 10 variables
> movies$bigactor
NULL
jojorabbit
  • 47
  • 6
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Nov 03 '22 at 18:05
  • thank you for your patience as I am not well versed in stackoverflow. have made the edits, hope it helps! – jojorabbit Nov 04 '22 at 10:58

1 Answers1

1

Here is my approach. Make the list of actors of interest.
Then loop through the list (using sapply()) of movies and find the movies containing (%in%) the actors of interest. Return a vector of TRUE/FALSE for corresponding to matches.

movies <- list(c("Rishab Shetty", " Sapthami Gowda", " Kishore Kumar G.", "Achyuth Kumar"), 
               c("Christian Bale", " Heath Ledger", " Aaron Eckhart", " Michael Caine"), 
               c("Elijah Wood", " Viggo Mortensen", " Ian McKellen",  " Orlando Bloom"), 
               c("Leonardo DiCaprio", " Joseph Gordon-Levitt", " Elliot Page", " Ken Watanabe"), 
               c("Elijah Wood", " Ian McKellen", " Viggo Mortensen", " Orlando Bloom"), 
               c("Elijah Wood", " Ian McKellen",  " Orlando Bloom", " Sean Bean"), 
               c("Keanu Reeves", " Laurence Fishburne",  " Carrie-Anne Moss", " Hugo Weaving"), 
               c("Mark Hamill", " Harrison Ford", " Carrie Fisher", " Billy Dee Williams"), 
               c("Arnold Schwarzenegger",      " Linda Hamilton", " Edward Furlong", " Robert Patrick"), 
               c("Mark Hamill", " Harrison Ford", " Carrie Fisher", " Alec Guinness"))



#create actors list
#adding trimws to remove leading and trailing spaces
actorlist <- unlist(movies) |> trimws()
#shortened down to 7 for debugging
top20actorstable <- sort(table(actorlist), decreasing = T)[1:7] |> names()

#loop through the list looking for matching actors
#returns a vector of true/false for the matches
bigactor <- sapply(movies, function(movie) {
   any(trimws(movie) %in% top20actorstable)
})
bigactor
as.integer(bigactor)

Since the data sample you provided is a list, I am not sure where the final results are stored. You could try to store your list of vectors in a data frame but that is complicated and very helpful.

Dave2e
  • 22,192
  • 18
  • 42
  • 50