I am trying to count word frequencies from different pdf documents in R with the tm package
. But in the way that I am doing it, I have been able only to count words independently. I would like to count words taking in account stems. For example: If I use the keyword "water", I would like to count together "water" and "waters". Here, it is the script so far.
library(NLP); library(SnowballC);library(tm); library(pdftools)
setwd("C:/Users/Guido/Dropbox/NBSAPs_ed/English")
# To grab those files ending with “pdf”
files <- list.files(pattern = "pdf$")
# To extract text is pdf_text.
NBSAPs <- lapply(files, pdf_text)
# Create a corpus.
NBSAPs_corp <- Corpus(VectorSource(NBSAPs))
# To creating the term-document matrix.
NBSAPs_tdm <- TermDocumentMatrix(NBSAPs_corp, control = list(removePunctuation = TRUE,
tolower = TRUE,
removeNumbers = TRUE))
# To inspect the 10 first arrows.
inspect(NBSAPs_tdm[1:10,])
# To convert as matrix
NBSAPs_table <- as.matrix(NBSAPs_tdm)
#Columns Names
names<- NULL
for(i in files){
names[i] <- paste0(i)
}
colnames(NBSAPs_table) <- names
# Table for keywords
keywords <- c("water")
final_NBSAPs_table <- NBSAPs_table[keywords, ]
row.names(final_NBSAPs_tab