I want to identify major n-grams in a bunch of academic papers, including n-grams with nested stopwords, but not n-grams with leading or trailing stopwords.
I have about 100 pdf files. I converted them to plain-text files through an Adobe batch command and collected them within a single directory. From there I use R. (It's a patchwork of code because I'm just getting started with text mining.)
My code:
library(tm)
# Make path for sub-dir which contains corpus files
path <- file.path(getwd(), "txt")
# Load corpus files
docs <- Corpus(DirSource(path), readerControl=list(reader=readPlain, language="en"))
#Cleaning
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
# Merge corpus (Corpus class to character vector)
txt <- c(docs, recursive=T)
# Find trigrams (but I might look for other ngrams as well)
library(quanteda)
myDfm <- dfm(txt, ngrams = 3)
# Remove sparse features
myDfm <- dfm_trim(myDfm, min_count = 5)
# Display top features
topfeatures(myDfm)
# as_well_as of_the_ecosystem in_order_to a_business_ecosystem the_business_ecosystem strategic_management_journal
#603 543 458 431 431 359
#in_the_ecosystem academy_of_management the_role_of the_number_of
#336 311 289 276
For example, in the top ngrams sample provided here, I'd want to keep "academy of management", but not "as well as", nor "the_role_of". I'd like the code to work for any n-gram (preferably including less than 3-grams, although I understand it's simpler in this case to just remove stopwords first).