4

Note - I have referred answer, but although the data is un-nested but I could not convert data into csv file format.

I want to flatten the data of different data types by using explode functionality. The dataset contains arrays and structure. I want to explode the data so that I can further convert it into CSV file format.

Introduction

R code is written by using Sparklyr package to create database schema. [Reproducible code and database is given]

Existing Result

root
|-- contributors : string
|-- created_at : string
|-- entities (struct)
|     |-- hashtags (array) : [string]
|     |-- media (array)
|     |     |-- additional_media_info (struct)
|     |     |       |-- description : string
|     |     |       |-- embeddable : boolean
|     |     |       |-- monetizable : bollean
|     |     |-- diplay_url : string
|     |     |-- id : long
|     |     |-- id_str : string
|     |-- urls (array)     
|-- extended_entities (struct)
|-- retweeted_status (struct)
|-- user (struct)

I want to flatten this structure as below,

Expected Result

root
|-- contributors : string
|-- created_at : string
|-- entities (struct)
|-- entities.hashtags (array) : [string]
|-- entities.media (array)
|-- entities.media.additional_media_info (struct)
|-- entities.media.additional_media_info.description : string
|-- entities.media.additional_media_info.embeddable : boolean
|-- entities.media.additional_media_info.monetizable : bollean
|-- entities.media.diplay_url : string
|-- entities.media.id : long
|-- entities.media.id_str : string
|-- entities.urls (array)     
|-- extended_entities (struct)
|-- retweeted_status (struct)
|-- user (struct)

Database Navigate to: Data-0.5 MB . Then copy the numbered items to a text file named "example". Save to a directory named "../example.json/" created in your working directory.

The R code is written to reproduce the example as below,

Exiting Code

library(sparklyr)
library(dplyr)
library(devtools)
  devtools::install_github("mitre/sparklyr.nested")
# If Spark is not installed, then also need:
# spark_install(version = "2.2.0")
library(sparklyr.nested)
library(testthat)
library(jsonlite)

Sys.setenv(SPARK_HOME="/usr/lib/spark")    
conf <- spark_config()
conf$'sparklyr.shell.executor-memory' <- "20g"
conf$'sparklyr.shell.driver-memory' <- "20g"
conf$spark.executor.cores <- 16
conf$spark.executor.memory <- "20G"
conf$spark.yarn.am.cores  <- 16
conf$spark.yarn.am.memory <- "20G"
conf$spark.executor.instances <- 8
conf$spark.dynamicAllocation.enabled <- "false"
conf$maximizeResourceAllocation <- "true"
conf$spark.default.parallelism <- 32
sc <- spark_connect(master = "local", config = conf, version = '2.2.0') # Connection             
 sample_tbl <- spark_read_json(sc,name="example",path="example.json", header = TRUE, memory = FALSE, overwrite = TRUE) 
 sdf_schema_viewer(sample_tbl) # to create db schema

I want to flatten the data of different data types by using explode functionality. Please don't use another package because my 1 billion data is not readable by using other than Sparklyr package. Sparklyr package only read this huge data within few minutes.

Goal - Further I want this exploded data to convert into proper csv file format.

Shree
  • 203
  • 3
  • 22

1 Answers1

3

In short you two requirements:

  • Expected schema.
  • I want this un-nested data to convert into proper csv file format.

Are not compatible. CSV reader cannot handle complex types including arrays, for which clearly ask in this and the previous questions.

To fully satisfy the second requirement you'll have to use one of the following

  • Serialize array and structs, for example using JSON
  • Fully explode dataset, including top level array fields, to achieve tidy format.This will however significantly increase the amount of data as for each record you'll get length(hashtags) * length(media) * length(urls) rows.