R text mining: error in extracting bigrams from corpus during tokenization step

Question

Fetching data from url:

suppressMessages(library(readr))
suppressMessages(library(RCurl))


amazon_url <- getURL('http://s3.amazonaws.com/assets.datacamp.com/production/course_935/datasets/500_amzn.csv',
      ssl.verifyhost=FALSE, ssl.verifypeer=FALSE)
amazon <- read.csv(textConnection(amazon_url), header = TRUE)

Create amzn_cons:

amazon_cons <- amazon$cons

Build cleaning function based qdap package for text organization:

suppressWarnings(library(qdap))
qdap_clean <- function(x) {
x <- replace_abbreviation(x)
x <- replace_contraction(x)
x <- replace_number(x)
x <- replace_ordinal(x)
x <- replace_symbol(x)
x <- tolower(x)
return(x)
}

Build cleaning function based on tm package for text organization:

suppressWarnings(library(tm))
tm_clean <- function(corpus) {
      corpus<- tm_map(corpus, removePunctuation)
      corpus <- tm_map(corpus, stripWhitespace)
      corpus <- tm_map(corpus, removeWords,
      c(stopwords("en"), "Amazon","company"))
      return(corpus)
}

Word cleaning:

amzn_cons <- qdap_clean(amazon_cons)
amzn_cons <- VCorpus(VectorSource(amzn_cons))
amzn_cons_corp <- tm_clean(amzn_cons)

Build custom function to extract bigram features:

    suppressWarnings(library(RWeka))
    tokenizer <- function(x) 
    NGramTokenizer(x, Weka_control(min = 2, max = 2))

Apply tokenization function to get bigrams words:

  amzn_c_tdm <- TermDocumentMatrix(
  amzn_cons_corp,control = list(tokenize = tokenizer) )

This results in the following error:

Error in .jcall("RWekaInterfaces", "[S", "tokenize", .jcast(tokenizer,  : 
  java.lang.NullPointerException

How to solve this error?

I have edited your question for clarity and formatting. Please see that no essential information is lost. — tchakravarty, Jan 01 '17 at 07:06
@Huang This question [appeared already](http://stackoverflow.com/q/33744495/1655567) on quite a few occasions, it seems that the problem may be concerned with the Java installation, I would suggest that you consider following [this discussion](http://stackoverflow.com/q/35179151/1655567) to ensure that your Java configuration is fit for purpose. — Konrad, Jan 01 '17 at 09:07
if you are using a linux machine, try doing $sudo R CMD javareconf, if you see an error then there are issues with your java installation — Dinesh.hmn, Jan 01 '17 at 15:37
10. stop(structure(list(message = "java.lang.NullPointerException", call = .jcall("RWekaInterfaces", "[S", "tokenize", .jcast(tokenizer, "weka/core/tokenizers/Tokenizer"), .jarray(as.character(control)), .jarray(as.character(x))), jobj = ), .Names = c("message", ... 9. .jcheck() 8. .jcall("RWekaInterfaces", "[S", "tokenize", .jcast(tokenizer, "weka/core/tokenizers/Tokenizer"), .jarray(as.character(control)), .jarray(as.character(x))) 7. NGramTokenizer(x, Weka_control(min = 3, max = 3)) 6. .tokenize(doc) @Dinesh.hmn — Huang, Jan 04 '17 at 04:07
could i know the possible detail issue about the java installation ? above is the error traceback code — Huang, Jan 04 '17 at 04:07

R text mining: error in extracting bigrams from corpus during tokenization step

0 Answers0