0

I am trying to count word frequencies from different pdf documents in R with the tm package. But in the way that I am doing it, I have been able only to count words independently. I would like to count words taking in account stems. For example: If I use the keyword "water", I would like to count together "water" and "waters". Here, it is the script so far.

library(NLP); library(SnowballC);library(tm); library(pdftools)

setwd("C:/Users/Guido/Dropbox/NBSAPs_ed/English")

# To grab those files ending with “pdf”
files <- list.files(pattern = "pdf$")

# To extract text is pdf_text.
NBSAPs <- lapply(files, pdf_text)

# Create a corpus.
NBSAPs_corp <- Corpus(VectorSource(NBSAPs))

# To creating the term-document matrix.
NBSAPs_tdm <- TermDocumentMatrix(NBSAPs_corp, control = list(removePunctuation = TRUE,
                                                             tolower = TRUE,
                                                            removeNumbers = TRUE)) 

# To inspect the 10 first arrows.
inspect(NBSAPs_tdm[1:10,])

# To convert as matrix
NBSAPs_table <- as.matrix(NBSAPs_tdm)


#Columns Names

names<- NULL
for(i in files){
names[i] <- paste0(i)
}
colnames(NBSAPs_table) <- names

# Table for keywords
keywords <- c("water")
final_NBSAPs_table <- NBSAPs_table[keywords, ]
row.names(final_NBSAPs_tab

0 Answers0