Does anyone know how I can work with big data in R?

Question

Analyzing tweets in RStudio:

My csv file contains 4,000,000 tweets with five columns: screen_name, text, created_at, favorite_count, and retweet_count.

I am trying to identify the frequency of hashtags using the following codes, however it runs too slowly for several days and sometimes RStudio crashes.

mydata %>%
  unnest_tokens(word, text, token ="tweets") %>%
  anti_join(stop_words, by= "word")

I have used other approaches to handle big data in R such as: https://rviews.rstudio.com/2019/07/17/3-big-data-strategies-for-r/ or https://spark.rstudio.com/guides/textmining/ and Spark library: https://spark.rstudio.com/guides/textmining/. None of them work well for me.

In Spark, I do the following, but RStudio is not able to copy my dataset to Spark. I see that "Spark is Running" in my RStudio for even one day without copying my dataset to Spark.

Connect to your Spark cluster:

spark_conn <- spark_connect("local")

Copy track_metadata to Spark:

track_metadata_tbl <- copy_to(spark_conn, my_database)

Do you have any suggestions/instructions/links that would help me analyze my data?

My laptop is a Mac Processor: 2.9 GHz Dual-Core Intel Core i5 Memory: 8 GB 2133 MHz LPDDR3

how is the memory usage? Go to `Activity Monitor` and report back. 8GB is not much for doing anything: my 16GB mbpro is constantly at the edge of what it can handle even without doing data procssing. Also: `sparkr` IS a good idea: you could ask another question on how to get that running — WestCoastProjects, Mar 29 '20 at 23:39
What about loading into a local MySQL or Postgres database and running aggregations on it? — hyprnick, Apr 28 '20 at 20:16

score 1 · Answer 1 · answered Apr 13 '20 at 04:05

If I were in your situation, I would not try to parse that whole file at once but instead work with a chunk at a time.

I would use vroom to read in the data, and work with chunks of the data at a time (starting with, say, 50k lines and then seeing how much you can scale up to do at once).

If you are interested in only counting hashtags, you can do something like:

mydata %>%
  unnest_tokens(word, text, token ="tweets") %>%
  filter(str_detect(word, "^#")) %>%
  count(word, sort = TRUE)

And append this to a new CSV of aggregated results. Then work through your whole dataset in chunks. At the end, you can parse your CSV of results and re-aggregate your counts to sum up and find the hashtag frequencies.

score 0 · Answer 2 · edited Jun 20 '20 at 09:12

8GB is not very much memory really: please do look at Activity Monitor to see what the memory utilization is.

Using sparklyr can be a very good idea. I suspect the memory issues are causing the loading to fail. You will need to do some legwork to properly optimize the spark local instance. Here are some resources for getting sparkR going:

https://github.com/sparklyr/sparklyr/issues/525

Configuring correctly the executor's memory amount I was able to run copy_to with no problems.

Another one:

Now it works for me. I didn't configured correctly the driver's memory. I increased it and now everything works perfectly.

Here's a note about an alternative to copy_to()

https://community.rstudio.com/t/sparklyr-s-error/12370

copy_to() is currently not optimized and therefore, it is not recommended for copying medium nor large data sets. Instead, we recommend you copy the data into the cluster and then load the data in Spark using the family of spark_read_*() functions. For instance, by copying all the data as CSVs and then using spark_read_csv().

That said, we are also looking into making improvements to copy_to() and collect() using Apache Arrow, you can track progress of this work with this pull request: github.com/rstudio/sparklyr/pull/1611 .

Does anyone know how I can work with big data in R?

2 Answers2

Linked