0

The Object in the S3 bucket is 5.3 GB size. In order to convert object into data, I used get_object("link to bucket path"). But this leads to memory issues.

So, I installed Spark 2.3.0 in RStudio and trying to load this object directly into Spark but the command to load object directly into spark is not known. library(sparklyr) library(dplyr) sc <- spark_connect(master = "local")

If I convert the object into a readable data type (such as data.frame/tbl in R) I would use copy_to to transfer the data into spark from R as below:

Copy data to Spark

spark_tbl <- copy_to(spark_conn,data)

I was wondering how can convert the object inside spark ?

relevant links would be

  1. https://github.com/cloudyr/aws.s3/issues/170

  2. Sparklyr connection to S3 bucket throwing up error

Any guidance would be sincerely appreciated.

Abhishek
  • 471
  • 5
  • 17
  • What do you mean by convert? – MLavoie Jul 30 '18 at 09:55
  • convert the object type to data. So that I can use it as data.frame or tibble in R. Currently, the object is not readable within R unless it it converted into any kind of data.frame – Abhishek Jul 30 '18 at 10:17
  • `get_object` returns a `list` if I am not mistaken. We have no way to tell how it looks like or what output you expect. Please post a [mcve]. – zero323 Jul 31 '18 at 23:22

1 Answers1

0

Solution.

I was trying to read the csv file which is 5.3 GB from S3 bucket. But Since R is single-threaded, it was giving memory issues (IO exceptions).

However, the solution is to load sparklyr in R (library(sparklyr)) and hence now all the cores in the computer will be utilized.

get_object("link to bucket path") can be replaced by spark_read_csv("link to bucket path"). Since RStudio uses all cores, we have no memory issues.

Also, depending on the file extension, you can change the functions: ´spark_load_table, spark_read_jdbc, spark_read_json, spark_read_libsvm, spark_read_parquet, spark_read_source, spark_read_table, spark_read_text, spark_save_table, spark_write_csv, spark_write_jdbc, spark_write_json, spark_write_parquet, spark_write_source, spark_write_table, spark_write_text´

Abhishek
  • 471
  • 5
  • 17