Read in large text file (100 MB)

Question

I am working on text mining project with R. The file size is over 100 MB. I managed to read the file and did some text processing, however, when I get to the point of removing stop words, RStudio crushed. What would be the best solution, please?

Should I split the file into 2 or 3 files, process them and then merge them again before applying any analytics? anyone has the code to split. I tried several options available online and none of them seems to work.

Here is the code I used. Everything worked smoothly except the removing stop words

# Install
install.packages("tm")  # for text mining
install.packages("SnowballC") # for text stemming
install.packages("wordcloud") # word-cloud generator 
install.packages("RColorBrewer") # color palettes

# Load
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")

library(readr)
doc <- read_csv(file.choose())

docs <- Corpus(VectorSource(doc))
docs

# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))

# Remove numbers
docs <- tm_map(docs, removeNumbers)

# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))

could you provide code, there are many R packages that can process text. — missuse, Aug 26 '18 at 07:50
Your question is unclear, please read and edit your question according to: [How to make a great R reproducible example?](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) Otherwise it might get closed as *too broad*, *why isn't this code working* or *tool recommendation*. — pogibas, Aug 26 '18 at 07:52
Do you really know the reason why Rstudio crashed? What code were you running at the time? Also, 100MB is not even close to a large file. — pogibas, Aug 26 '18 at 07:55
Tried the same on 3 different laptops. All crushed at stop words part. The codes did run on very small test files — Ameer B., Aug 26 '18 at 08:04
So the problem is not with reading in text file? Can you edit your question? Also, please remove all unnecessary code (word cloud, color palettes) - they are not related to the question. — pogibas, Aug 26 '18 at 08:24

score 1 · Answer 1 · answered Aug 26 '18 at 13:29

If you have a lot of words in the corpus, R will take a long time removing stopwords. tm removeWords is basicly a giant gsub and works like this:

gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), 
                                       collapse = "|")), "", x, perl = TRUE)

Because every word (x) in the corpus is being checked on stopwords, and a 100MB file contains a lot of words, Rstudio might crash as it doesn't receive a response back from R for a while. I'm not sure if there is a timeout built into RStudio somewhere.

Now you could run this code in the R console; this shouldn't crash, but you might wait a long time. You could use the package beepr to create a sound when the process is done.

If possible, my advice would be to switch to the quanteda package as this will run in parallel out of the box, is better documented, supported and has less utf-8 issues compared to tm. At least that is my experience.

But you could also try to run your tm code in parallel like the code below and see if this works a bit better:

library(tm)

# your code reading in files

library(parallel)
cores <- detectCores()

# use cores-1 if you want to do anything while the code is running.
cl <- makeCluster(cores)   
tm_parLapply_engine(cl)

docs <- Corpus(VectorSource(doc))

# Convert the text to lower case, remove numbers and stopwords
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))

# rest of tm code if needed

tm_parLapply_engine(NULL)
stopCluster(cl)

If you are going to do calculations on a big document term matrix that you will get with a lot of words, make sure you are using functions from the slam package (installed when installing tm). These functions keep the document term matrix in a sparse form. Otherwise your document term matrix might be transformed into a dense matrix and your R session will crash because of too much memory consumption.

Read in large text file (100 MB)

1 Answers1