R - Text Analysis - Misleading results

Question

I am doing some text analysis of comments from bank customers related to mortgages and I find a couple of things I do understand.

1) After cleaning data without applying Stemming Words and checking the dimension of the TDM the number of terms (2173) is smaller than the number of documents (2373)(This is before remove stop words and being the TDM a 1-gram).

2) Also, I wanted to check the 2-words frequency (rowSums(Matrix)) of the bi-gram tokenizing the TDM. The issue is that for example I have gotten as the most repeated result the 2-words "Proble miss". Since this grouping was already strange, I have gone to the dataset, "Control +F", to try to find and i could not. Questions: it seems that the code some how has stemmed these words, how is it possible? (From the top 25 bi-words, this one is the only one that seems to be stemmed). Is this not supposed to ONLY create bi-grams that are always together?

{file_cleaning <-  replace_number(files$VERBATIM)
file_cleaning <-  replace_abbreviation(file_cleaning)
file_cleaning <-  replace_contraction(file_cleaning)
file_cleaning <- tolower(file_cleaning)
file_cleaning <- removePunctuation(file_cleaning)
file_cleaning[467]
file_cleaned <- stripWhitespace(file_cleaning)

custom_stops <- c("Bank")
file_cleaning_stops <- c(custom_stops, stopwords("en"))
file_cleaned_stopped<- removeWords(file_cleaning,file_cleaning_stops)

file_cleaned_corups<- VCorpus(VectorSource(file_cleaned))
file_cleaned_tdm <-TermDocumentMatrix(file_cleaned_corups)
dim(file_cleaned_tdm) # Number of terms <number of documents
file_cleaned_mx <- as.matrix(file_cleaned_tdm)

file_cleaned_corups<- VCorpus(VectorSource(file_cleaned_stopped))
file_cleaned_tdm <-TermDocumentMatrix(file_cleaned_corups)
file_cleaned_mx <- as.matrix(file_cleaned_tdm)

dim(file_cleaned_mx)
file_cleaned_mx[220:225, 475:478]

coffee_m <- as.matrix(coffee_tdm)

term_frequency <- rowSums(file_cleaned_mx)
term_frequency <- sort(term_frequency, decreasing = TRUE)
term_frequency[1:10]


BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bigram_dtm <- TermDocumentMatrix(file_cleaned_corups, control = list(tokenize = BigramTokenizer))
dim(bigram_dtm)

bigram_bi_mx <- as.matrix(bigram_dtm)
term_frequency <- rowSums(bigram_bi_mx)
term_frequency <- sort(term_frequency, decreasing = TRUE)
term_frequency[1:15]

freq_bigrams <- findFreqTerms(bigram_dtm, 25)
freq_bigrams}

SAMPLE of DATASET:

> dput(droplevels(head(files,4)))

structure(list(Score = c(10L, 10L, 10L, 7L), Comments = structure(c(4L,

3L, 1L, 2L), .Label = c("They are nice an quick. 3 years with them, and no issue.",

"Staff not very friendly.",

"I have to called them 3 times. They are very slow.",

"Quick and easy. High value."

), class = "factor")), row.names = c(NA, 4L), class = "data.frame")

Welcome to SO! This community has a few [rules](https://stackoverflow.com/help/on-topic) and [norms](https://stackoverflow.com/help/how-to-ask) and following them will help you get a good answer to your question. In particular, it’s best to provide an [MCVE](https://stackoverflow.com/help/mcve) (a minimum, complete, and verifiable example). Good advice for R-specific MVCEs is available [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) and [here](https://reprex.tidyverse.org/articles/reprex-dos-and-donts.html). — DanY, Sep 10 '18 at 04:00
Specifically, what do your data look like? (Use `dput(head(my_df))` to get your data out of R and onto SO.) Where are your non-Base-R functions defined? (e.g., where do I find `replace_abbreviation`? Is it in a package?) And finally, you should zero-in on the part of the code causing you trouble -- 42 lines of code is too much for us to dig through. Thanks and good luck! — DanY, Sep 10 '18 at 04:05
Thanks @DanY !! Sorry I do not comment often so no much practice! I have added what a sample would look like. Re you question about functions, they are mainly from the TM package. — Robbie, Sep 10 '18 at 20:57

score 1 · Accepted Answer · answered Sep 10 '18 at 09:56

Q1: There are situations where you can end up with less terms than documents.

First you are using vectorsource; the number of documents are the number of vectors you have in your txt. This is not really representative of the number of documents. A vector with a space in it would count as a document. Secondly you are removing stopwords. If there are many of these in your text, a lot of words will disappear. Finally TermDocumentMatrix by default removes all words smaller than 3. So if there are any small words left after removing stopwords, these will be removed as well. You can adjust this by adjusting the option wordLengths when creating a TermDocumentMatrix / DocumentTermMatrix.

# wordlengths starting at length 1, default is 3
TermDocumentMatrix(corpus, control=list(wordLengths=c(1, Inf)))

Q2: without a sample document this is a bit of a guess.

Likely a combination of the functions replace_number, replace_contraction, replace_abbreviation, removePunctuation and stripWhitespace. This might result in a word that you can't find very fast. Best bet is to look for each word starting with prob. "proble" is as far as I can see, not a correct stem. Also qdap and tm don't do any stemming without you specifying it.

You also have a mistake in your custom_stops. All stopwords are in lowercase and you specified that your text should be in lowercase. So your custom_stops should also be in lowercase. "bank" instead of "Bank".

Thanks @phiver! Q1: My limited understanding was that 1 vertor = to 1 document. What would you mean by "A vector with a space in it would count as a document."? Q2: Just to clarify, I was not doing any stemming. If this helps to clarify more, I have done control + F with "proble " and that is not in the dataset. That is why I find strange that "proble miss" came up. Good call about "Bank" :) — Robbie, Sep 10 '18 at 21:28
@Robbie, try this code `text <- c(" ", "text"); my_corp <- Corpus(VectorSource(text)); my_dtm <- DocumentTermMatrix(my_corp); inspect(my_dtm)`. You can see that 2 documents are created, but only "text" will be shown in the dtm, the vector with " " has been removed. (adjusting wordLength wouldn't change the result). So resulting in more documents than terms. @Q2, try looking for "miss" in your text. — phiver, Sep 11 '18 at 08:30
Hey @priver, Sorry, what would "text" be? The whole column with all the comments? Q2- I have looked up for just the word "miss" but I could not find it as such, yes as "missed" for example...Thanks! — Robbie, Oct 21 '18 at 22:43
@Robbie, in the example in the comment "text" is just the word text. Nothing more. — phiver, Oct 22 '18 at 12:38

R - Text Analysis - Misleading results

1 Answers1