With the help of the tidytext package, I'm trying to count all bigrams and trigrams for a personal example. However, this personal dataset has +1 million lines (paragraphs really) and lots of words in each one. This is a memory-intensive process which sometimes crashes RStudio.
I have already tried using the sparklyr package for this text mining task (https://spark.rstudio.com/guides/textmining/) but its functions aren't well documented and the tokenization process is really cumbersome (for example, it isn't that easy to remove certain punctuations from the text. On the other hand, Tidytext handles this better).
However, I was wondering if there exists a way to use parallel computing through the parallel package, the foreach package, or any other package, to "divide" the task in different parts and assign them to each one of my cores for more efficiency? Look at the following example below: I have no problem dividing the unnest_tokens()
section into 10 different parts and then rbind()
them together at the end for example to work with the final data frame.
library(dplyr)
library(tidytext)
library(janeaustenr)
austen_bigrams <- austen_books() %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
count(bigram, sort = TRUE)
Is there any way to achieve this? How can I improve the efficiency? I'm not willing to use a subsample of my dataset, I really wish to use all my data for this personal exercise.
Also, in this particular question: Does anyone know how I can work with big data in R? Julia Silge, one of the creators of the Tidytext package, recommended the use of the vroom package to
...read in the data, and work with chunks of the data at a time (starting with, say, 50k lines and then seeing how much you can scale up to do at once).
and then:
append this to a new CSV of aggregated results. Then work through your whole dataset in chunks. At the end, you can parse your CSV of results and re-aggregate your counts to sum up and find the hashtag frequencies.
If the parallel computing idea is impossible or not useful in this case, how could I achieve what she mentions with the help of the vroom package? Its utility isn't clear to me.
Bear in mind I would like to use the final dataframe for several purposes, where one of them is a small top 10 or top 20 plot (the 10 or 20 most common words/2 grams/3 grams after removing stop-words).
For more information, here's my number of cores:
library(parallel)
detectCores()
[1] 8
detectCores(logical = FALSE)
[1] 4
Thanks! I'm really new to big data and parallel computing in R. Any type of help would be greatly appreciated.