-2

Introdution

I have written following R code by referring Link-1. Here, Sparklyr package is used in R programming to read huge data from JSON file. But, while creating CSV file, it has shown the error.

R code

sc <- spark_connect(master = "local", config = conf, version = '2.2.0')
sample_tbl <- spark_read_json(sc,name="example",path="example.json", header = TRUE, 
                              memory = FALSE, overwrite = TRUE)
sdf_schema_viewer(sample_tbl) # to create db schema
sample_tbl %>% spark_write_csv(path = "data.csv") # To write CSV file

Last line shows the following error. Dataset contains different data types. If required I can show the database schema. It contains nested data columns.

Error

Error: java.lang.UnsupportedOperationException: CSV data source does not support struct,media:array,display_url:string,expanded_url:string,id:bigint,id_str:string,indices:array,media......

Question

How to resolve this error? Is it due to the different data types or deep level 2 to 3 nested columns? Any help would be appreciated.

Shree
  • 203
  • 3
  • 22

1 Answers1

2

It seems that your dataframe has array data type, which is NOT supported by CSV. It seems it's not possible that CSV file can include array or other nest structure for this scenario.

Therefore, If you want your data to be human readable text, write out as Excel file.

Please note that Excel CSV (very special case though) supports arrays in CSV using "\n" inside quotes, but you have to use as EOL for the row "\r\n" (Windows EOL).

  • It is also worth pointing out that there is more here than just arrays. OPs data (https://stackoverflow.com/q/52194942/6910411, https://stackoverflow.com/q/52263836/6910411) contains a deeply nested structure, which really has no CSV equivalent. – zero323 Sep 11 '18 at 11:07
  • 1
    @rani The other question is still slightly unclear, but as far as I understand it, it is not enough. To write to csv, you can need only atomic types (string, integers, decimals, double, floats, Booleans) - no `structs` or `arrays` are allowed. This means you'll have to define either reshape the data with some combination of explode and nested accessors (possibly writing Scala extensions) or serialize the fields. Schema alone is ambiguous, so you should really provide example input and expected output. – zero323 Sep 11 '18 at 14:05