1

With the help of the tidytext package, I'm trying to count all bigrams and trigrams for a personal example. However, this personal dataset has +1 million lines (paragraphs really) and lots of words in each one. This is a memory-intensive process which sometimes crashes RStudio.

I have already tried using the sparklyr package for this text mining task (https://spark.rstudio.com/guides/textmining/) but its functions aren't well documented and the tokenization process is really cumbersome (for example, it isn't that easy to remove certain punctuations from the text. On the other hand, Tidytext handles this better).

However, I was wondering if there exists a way to use parallel computing through the parallel package, the foreach package, or any other package, to "divide" the task in different parts and assign them to each one of my cores for more efficiency? Look at the following example below: I have no problem dividing the unnest_tokens() section into 10 different parts and then rbind() them together at the end for example to work with the final data frame.

library(dplyr)
library(tidytext)
library(janeaustenr)

austen_bigrams <- austen_books() %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% 
  count(bigram, sort = TRUE)

Is there any way to achieve this? How can I improve the efficiency? I'm not willing to use a subsample of my dataset, I really wish to use all my data for this personal exercise.

Also, in this particular question: Does anyone know how I can work with big data in R? Julia Silge, one of the creators of the Tidytext package, recommended the use of the vroom package to

...read in the data, and work with chunks of the data at a time (starting with, say, 50k lines and then seeing how much you can scale up to do at once).

and then:

append this to a new CSV of aggregated results. Then work through your whole dataset in chunks. At the end, you can parse your CSV of results and re-aggregate your counts to sum up and find the hashtag frequencies.

If the parallel computing idea is impossible or not useful in this case, how could I achieve what she mentions with the help of the vroom package? Its utility isn't clear to me.

Bear in mind I would like to use the final dataframe for several purposes, where one of them is a small top 10 or top 20 plot (the 10 or 20 most common words/2 grams/3 grams after removing stop-words).

For more information, here's my number of cores:

library(parallel)
detectCores()
[1] 8
detectCores(logical = FALSE)
[1] 4

Thanks! I'm really new to big data and parallel computing in R. Any type of help would be greatly appreciated.

caproki
  • 348
  • 2
  • 18
  • Try using the quanteda framework. This runs in parallel from the start, default 2 threads, with `quanteda_options` you can set the number of threads to use. – phiver Apr 05 '21 at 09:18
  • @phiver Thanks for your suggestion. I'm already trying quanteda with the 8 threads maximum for my CPU. However, the `tokens()` functions takes forever and the same happens with `dfm()`. Is it still a viable option to use quanteda? Should I surrender and keep trying my luck with Spark for example? This is truly a big data problem... – caproki Apr 06 '21 at 23:55
  • Is the problem in reading in the data or is the problem in processing? If the latter, you might want to fire up an AWS or Azure instance and process it there. If the former check [this SO post](https://stackoverflow.com/questions/60928866/read-a-20gb-file-in-chunks-without-exceeding-my-ram-r) for more info. – phiver Apr 07 '21 at 08:50
  • Maybe it's relevant for you that you can combine `dfm` objects in quanteda using `rbind`. An option would be to spread tokenization on multiple machines and combine results afterwards? `dfm`s are highly compressed representations of texts so it should be easier to work with. I worked with 2 million paragraphs on my laptop before (32GB RAM) and it would take a bit but ultimately work fine. So maybe you just need some patience? – JBGruber May 06 '21 at 15:17

1 Answers1

1

Here is an example of creating bigrams from a data frame using foreach and doParallel (assuming you have more cores available).

library(foreach)
library(doParallel)

cl <- makeCluster(25, outfile = "")
registerDoParallel(cl)

dataset <- austen_books()

austin_bigrams <- foreach(m = isplitRows(dataset, chunks=25), 
.combine='rbind', .packages='tidytext') %dopar% 
{ unnest_tokens(m, ngrams, text, token = "ngrams", n = 2) }

generic
  • 302
  • 1
  • 3
  • 14