Finding ngrams in R and comparing ngrams across corpora

Question

I'm getting started with the tm package in R, so please bear with me and apologies for the big ol' wall of text. I have created a fairly large corpus of Socialist/Communist propaganda and would like to extract newly coined political terms (multiple words, e.g. "struggle-criticism-transformation movement").

This is a two-step question, one regarding my code so far and one regarding how I should go on.

Step 1: To do this, I wanted to identify some common ngrams first. But I get stuck very early on. Here is what I've been doing:

library(tm)
library(RWeka)

a  <-Corpus(DirSource("/mycorpora/1965"), readerControl = list(language="lat")) # that dir is full of txt files
summary(a)  
a <- tm_map(a, removeNumbers)
a <- tm_map(a, removePunctuation)
a <- tm_map(a , stripWhitespace)
a <- tm_map(a, tolower)
a <- tm_map(a, removeWords, stopwords("english")) 
a <- tm_map(a, stemDocument, language = "english") 
# everything works fine so far, so I start playing around with what I have
adtm <-DocumentTermMatrix(a) 
adtm <- removeSparseTerms(adtm, 0.75)

inspect(adtm) 

findFreqTerms(adtm, lowfreq=10) # find terms with a frequency higher than 10

findAssocs(adtm, "usa",.5) # just looking for some associations  
findAssocs(adtm, "china",.5)

# ... and so on, and so forth, all of this works fine

The corpus I load into R works fine with most functions I throw at it. I haven't had any problems creating TDMs from my corpus, finding frequent words, associations, creating word clouds and so on. But when I try to use identify ngrams using the approach outlined in the tm FAQ, I'm apparently making some mistake with the tdm-constructor:

# Trigram

TrigramTokenizer <- function(x) NGramTokenizer(x, 
                                Weka_control(min = 3, max = 3))

tdm <- TermDocumentMatrix(a, control = list(tokenize = TrigramTokenizer))

inspect(tdm)

I get this error message:

Error in rep(seq_along(x), sapply(tflist, length)) : 
invalid 'times' argument
In addition: Warning message:
In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'

Any ideas? Is "a" not the right class/object? I'm confused. I assume there's a fundamental mistake here, but I'm not seeing it. :(

Step 2: Then I would like to identify ngrams that are significantly overrepresented, when I compare the corpus against other corpora. For example I could compare my corpus against a large standard english corpus. Or I create subsets that I can compare against each other (e.g. Soviet vs. a Chinese Communist terminology). Do you have any suggestions how I should go about doing this? Any scripts/functions I should look into? Just some ideas or pointers would be great.

Thanks for your patience!

I had the same error, for me it worked when I set min different from max in Weka control... Don´t know if this is an option for you.... — holzben, Oct 27 '13 at 08:48
Thanks for your advice! Didn't work for me, though. The error message remains the same when I change the min/max values. — Markus D, Oct 27 '13 at 09:59
Just in case people ever find this or are interested: I have not actually solved the first problem, but did manage to work around it by using a similar function provided by the **RTextTools** package: `matrix <- create_matrix(corpus,ngramLength=3)` — Markus D, Oct 28 '13 at 14:43
Can you share some of your data (on a free temporary file hosting site, perhaps), that will help with reproducing your problem and finding solutions. — Ben, Oct 29 '13 at 03:46
Thank you. Yes, I have uploaded a corpus sample here: http://s000.tinyupload.com/index.php?file_id=46554569218218543610 — Markus D, Oct 29 '13 at 06:18
How would this be done with unstructured binary data? Say, on binary patterns within an EXE or PDF file, without decoding or analyzing the file format's structure? — Richard Żak, Jan 21 '14 at 18:45
Just set the amount of available cores to 1: `options(mc.cores=1)` — marbel, Oct 27 '15 at 17:42

score 7 · Answer 1 · answered Oct 31 '13 at 06:44

I could not reproduce your problem, are you using the latest versions of R, tm, RWeka, etc.?

require(tm)
a <- Corpus(DirSource("C:\\Downloads\\Only1965\\Only1965"))
summary(a)  
a <- tm_map(a, removeNumbers)
a <- tm_map(a, removePunctuation)
a <- tm_map(a , stripWhitespace)
a <- tm_map(a, tolower)
a <- tm_map(a, removeWords, stopwords("english")) 
# a <- tm_map(a, stemDocument, language = "english") 
# I also got it to work with stemming, but it takes so long...
adtm <-DocumentTermMatrix(a) 
adtm <- removeSparseTerms(adtm, 0.75)

inspect(adtm) 

findFreqTerms(adtm, lowfreq=10) # find terms with a frequency higher than 10
findAssocs(adtm, "usa",.5) # just looking for some associations  
findAssocs(adtm, "china",.5)

# Trigrams
require(RWeka)
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm <- TermDocumentMatrix(a, control = list(tokenize = TrigramTokenizer))
tdm <- removeSparseTerms(tdm, 0.75)
inspect(tdm[1:5,1:5])

And here's what I get

A term-document matrix (5 terms, 5 documents)

Non-/sparse entries: 11/14
Sparsity           : 56%
Maximal term length: 28 
Weighting          : term frequency (tf)

                                   Docs
Terms                               PR1965-01.txt PR1965-02.txt PR1965-03.txt
  â€ chinese press                              0             0             0
  â€ renmin ribao                               0             1             1
  â€” renmin ribao                              2             5             2
  â€œ chinese people                            0             0             0
  â€œrenmin ribaoâ€\u009d editorial             0             1             0
  etc.

Regarding your step two, here are some pointers to useful starts:

http://quantifyingmemory.blogspot.com/2013/02/mapping-significant-textual-differences.html http://tedunderwood.com/2012/08/14/where-to-start-with-text-mining/ and here's his code https://dl.dropboxusercontent.com/u/4713959/Neuchatel/NassrProgram.R

Thank you again, Ben. I checked my R, RWeka and tm versions and everything seems to be up to date. This error was apparently discussed before (http://stackoverflow.com/questions/17703553/) and you had weighed in that it might have something to do with the Java installation. I tried running the code on a Windows machine and everything went smoothly, so I'm guessing that was the issue. As for Step 2, Ted Underwood's Nassr script appears to do pretty much what I'm looking for, only with words instead of ngrams. I will try to decipher it and learn from it! Thanks! — Markus D, Oct 31 '13 at 07:18
No worries. Yes, Java... all I remember about that is that it's the source of a lot of frustration! Glad to hear you've got a few options for getting past that hurdle. Curious to see how your n-grams overrepresentation analysis goes, do post another question on that when you've got some code working. — Ben, Oct 31 '13 at 07:33

score 2 · Answer 2 · edited May 23 '17 at 12:16

2

Regarding Step 1, Brian.keng gives a one liner workaround here https://stackoverflow.com/a/20251039/3107920 that solves this issue on Mac OSX - it seems to be related to parallelisation rather than ( the minor nightmare that is ) java setup on mac.

edited May 23 '17 at 12:16

Community

1
1

answered Mar 26 '14 at 13:02

cenau

66
5

score 1 · Answer 3 · edited Nov 10 '14 at 03:37

You may want to explicitly access the functions like this

BigramTokenizer  <- function(x) {
    RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 3))
}

myTdmBi.d <- TermDocumentMatrix(
    myCorpus.d,
    control = list(tokenize = BigramTokenizer, weighting = weightTfIdf)
)

Also, some other things that randomly came up.

myCorpus.d <- tm_map(myCorpus.d, tolower)  # This does not work anymore

Try this instead

 myCorpus.d <- tm_map(myCorpus.d, content_transformer(tolower))  # Make lowercase

In the RTextTools package,

create_matrix(as.vector(C$V2), ngramLength=3) # ngramLength throws an error message.

score 0 · Answer 4 · answered Nov 23 '13 at 20:02

Further to Ben's answer - I couldn't reproduce this either, but in the past I've had trouble with the plyr package and conflicting dependencies. In my case there was a conflict between Hmisc and ddply. You could try adding this line just prior to the offending line of code:

tryCatch(detach("package:Hmisc"), error = function(e) NULL)

Apologies if this is completely tangental to your problem!

Finding ngrams in R and comparing ngrams across corpora

4 Answers4

Linked