0

I found a very useful piece of code within Stackoverflow - Finding 2 & 3 word Phrases Using R TM Package (credit @patrick perry) to show the frequency of 2 and 3 word phrases within a corpus:

library(corpus)
corpus <- gutenberg_corpus(55) # Project Gutenberg #55, _The Wizard of Oz_
text_filter(corpus)$drop_punct <- TRUE # ignore punctuation
term_stats(corpus, ngrams = 2:3)
##    term             count support
## 1  of the             336       1
## 2  the scarecrow      208       1
## 3  to the             185       1
## 4  and the            166       1
## 5  said the           152       1
## 6  in the             147       1
## 7  the lion           141       1
## 8  the tin            123       1
## 9  the tin woodman    114       1
## 10 tin woodman        114       1
## 11 i am                84       1
## 12 it was              69       1
## 13 in a                64       1
## 14 the great           63       1
## 15 the wicked          61       1
## 16 wicked witch        60       1
## 17 at the              59       1
## 18 the little          59       1
## 19 the wicked witch    58       1
## 20 back to             57       1
## ⋮  (52511 rows total)

How do you ensure that frequency counts of phrases like "the tin" are not also included in the frequency count of "the tin woodman" or the "tin woodman"?

Thanks

Mcguns
  • 35
  • 6

1 Answers1

1

Removing stopwords can remove noise from the data, causing issues such as those you are having a above:

library(tm)
library(corpus)
library(dplyr)
corpus <- Corpus(VectorSource(gutenberg_corpus(55)))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
term_stats(corpus, ngrams = 2:3) %>% 
  arrange(desc(count)) %>%
  group_by(grp = str_extract(as.character(term), "\\w+\\s+\\w+")) %>% 
  mutate(count_unique = ifelse(length(unique(count)) > 1, max(count) - min(count), count)) %>% 
  ungroup() %>% 
  select(-grp)
hello_friend
  • 5,682
  • 1
  • 11
  • 15
  • Thanks @hello_friend. If the issue occurs in phrases without stop word such as "northern ireland" and "northern ireland counties". Is there a way to avoid double counting? – Mcguns Jun 06 '20 at 10:02
  • 1
    @Mcguns please see revised solution above – hello_friend Jun 06 '20 at 10:17
  • ran on my own data and works great, thanks very much the tibble output is what I need. I can't provide the original data in full but I've tried this code again but changing to ngrams = 1:3 to examine single words too but it's difficult to determine the output – Mcguns Jun 06 '20 at 10:27
  • 1
    The regex in the str_extract in the original function won't stand up to 1:3, it works for 2:3. – hello_friend Jun 06 '20 at 10:59
  • Thanks @hello_friend - I'll keep the single word and 2-3 phrases analysis separate and work out counts by comparison to make sure no double counting for frequency analysis – Mcguns Jun 06 '20 at 11:10