I am doing some text analysis of comments from bank customers related to mortgages and I find a couple of things I do understand.
1) After cleaning data without applying Stemming Words and checking the dimension of the TDM the number of terms (2173) is smaller than the number of documents (2373)(This is before remove stop words and being the TDM a 1-gram).
2) Also, I wanted to check the 2-words frequency (rowSums(Matrix)) of the bi-gram tokenizing the TDM. The issue is that for example I have gotten as the most repeated result the 2-words "Proble miss". Since this grouping was already strange, I have gone to the dataset, "Control +F", to try to find and i could not. Questions: it seems that the code some how has stemmed these words, how is it possible? (From the top 25 bi-words, this one is the only one that seems to be stemmed). Is this not supposed to ONLY create bi-grams that are always together?
{file_cleaning <- replace_number(files$VERBATIM)
file_cleaning <- replace_abbreviation(file_cleaning)
file_cleaning <- replace_contraction(file_cleaning)
file_cleaning <- tolower(file_cleaning)
file_cleaning <- removePunctuation(file_cleaning)
file_cleaning[467]
file_cleaned <- stripWhitespace(file_cleaning)
custom_stops <- c("Bank")
file_cleaning_stops <- c(custom_stops, stopwords("en"))
file_cleaned_stopped<- removeWords(file_cleaning,file_cleaning_stops)
file_cleaned_corups<- VCorpus(VectorSource(file_cleaned))
file_cleaned_tdm <-TermDocumentMatrix(file_cleaned_corups)
dim(file_cleaned_tdm) # Number of terms <number of documents
file_cleaned_mx <- as.matrix(file_cleaned_tdm)
file_cleaned_corups<- VCorpus(VectorSource(file_cleaned_stopped))
file_cleaned_tdm <-TermDocumentMatrix(file_cleaned_corups)
file_cleaned_mx <- as.matrix(file_cleaned_tdm)
dim(file_cleaned_mx)
file_cleaned_mx[220:225, 475:478]
coffee_m <- as.matrix(coffee_tdm)
term_frequency <- rowSums(file_cleaned_mx)
term_frequency <- sort(term_frequency, decreasing = TRUE)
term_frequency[1:10]
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bigram_dtm <- TermDocumentMatrix(file_cleaned_corups, control = list(tokenize = BigramTokenizer))
dim(bigram_dtm)
bigram_bi_mx <- as.matrix(bigram_dtm)
term_frequency <- rowSums(bigram_bi_mx)
term_frequency <- sort(term_frequency, decreasing = TRUE)
term_frequency[1:15]
freq_bigrams <- findFreqTerms(bigram_dtm, 25)
freq_bigrams}
SAMPLE of DATASET:
> dput(droplevels(head(files,4)))
structure(list(Score = c(10L, 10L, 10L, 7L), Comments = structure(c(4L,
3L, 1L, 2L), .Label = c("They are nice an quick. 3 years with them, and no issue.",
"Staff not very friendly.",
"I have to called them 3 times. They are very slow.",
"Quick and easy. High value."
), class = "factor")), row.names = c(NA, 4L), class = "data.frame")