0

I have the simpsons data from kaggle.com which includes titles of each episode. I want to check how many times the character names have been used in each title. I can find the exact words in titles but my code is missing out the words such as Homers when I look for Homer. Is there a way to do it?

Data example and my code:

text <- 'title
Homer\'s Night Out
Krusty Gets Busted
Bart Gets an "F"
Two Cars in Every Garage and Three Eyes on Every Fish
Dead Putting Society
Bart the Daredevil
Bart Gets Hit by a Car
Homer vs. Lisa and the 8th Commandment
Oh Brother, Where Art Thou?
Old Money
Lisa\'s Substitute
Blood Feud
Mr. Lisa Goes to Washington
Bart the Murderer
Like Father, Like Clown
Saturdays of Thunder
Burns Verkaufen der Kraftwerk
Radio Bart
Bart the Lover
Separate Vocations
Colonel Homer'

simpsons <- read.csv(text = text, stringsAsFactors = FALSE)

library(stringr)

titlewords <- paste(simpsons$title, collapse = " " )
words <- c('Homer')
titlewords <- gsub("[[:punct:]]", "", titlewords)
HomerCount <- str_count(titlewords, paste(words, collapse=" "))
HomerCount
rawr
  • 20,481
  • 4
  • 44
  • 78
Tugrul Uzel
  • 145
  • 1
  • 2
  • 10
  • Possible duplicate of [Selecting rows where a column has a string like 'hsa..' (partial string match)](http://stackoverflow.com/questions/13043928/selecting-rows-where-a-column-has-a-string-like-hsa-partial-string-match) – Sam Firke Nov 13 '16 at 20:56
  • 1
    dont you just want `sum(grepl('Homer', simpsons$title))` ? – rawr Nov 13 '16 at 20:58
  • And `sapply(gregexpr("Homer", simpsons$title), function(x) sum(x > 0))` for the count per string. – Rich Scriven Nov 13 '16 at 21:00
  • Thanks for the help, problem solved now! Great help – Tugrul Uzel Nov 13 '16 at 21:04
  • Is it possible to get in which string Homer is used? Rich's answer gives me a table with 1 and 0's but as I have 600 lines I don't know which lines they are in the list. I don't know if it is possible to get but that would be great if possible! – Tugrul Uzel Nov 13 '16 at 21:18

1 Answers1

0

In an alternative to the excellent suggestions in the comments, you can also use the tidytext package

library(tidytext)
library(dplyr)

text <- 'title
Homer\'s Night Out
Krusty Gets Busted
Bart Gets an "F"
Two Cars in Every Garage and Three Eyes on Every Fish
Dead Putting Society
Bart the Daredevil
Bart Gets Hit by a Car
Homer vs. Lisa and the 8th Commandment
Oh Brother, Where Art Thou?
Old Money
Lisa\'s Substitute
Blood Feud
Mr. Lisa Goes to Washington
Bart the Murderer
Like Father, Like Clown
Saturdays of Thunder
Burns Verkaufen der Kraftwerk
Radio Bart
Bart the Lover
Separate Vocations
Colonel Homer'

simpsons <- read.csv(text = text, stringsAsFactors = FALSE)

# Number of homers
simpsons %>%
  unnest_tokens(word, title) %>% 
  summarize(count = sum(grepl("homer", word)))

# Lines location of homers
simpsons %>% 
  unnest_tokens(word, title) %>% 
  mutate(lines = rownames(.)) %>% 
  filter(grepl("homer", word)) 
Jake Kaupp
  • 7,892
  • 2
  • 26
  • 36