I have data/rows of multiple key/value pairs with an unknown number of keys -- some overlapping and some not -- that I would like to create a Spark DataFrame from. My ultimate goal is to write CSV from this DataFrame.
I have flexibility with the input data/rows: most readily they are JSON strings, but could be converted, varying by potentially overlapping keys:
{"color":"red", "animal":"fish"}
{"color":"green", "animal":"panda"}
{"color":"red", "animal":"panda", "fruit":"watermelon"}
{"animal":"aardvark"}
{"color":"blue", "fruit":"apple"}
Ideally, I would like to create a DataFrame that looks like this from this data:
-----------------------------
color | animal | fruit
-----------------------------
red | fish | null
green | panda | null
red | panda | watermelon
null | aardvark | null
blue | null | apple
-----------------------------
Of note, data/rows without a particular key are null
, and all keys from the data/rows are represented as columns.
I feel relatively comfortable with many of the Spark basics, but am having trouble envisioning a process for efficiently taking my RDD/DataFrame with key/value pairs -- but an unknown number of columns and keys -- and creating a DataFrame with those keys as columns.
Efficient, in that I would like to avoid, if possible, creating an object where all input rows are held in memory (e.g. a single dictionary).
With, again, the final goal of writing CSV, where I'm assuming creating a DataFrame is a logical step to that end.
Another wrinkle:
Some of the data will be multivalued, something like:
{"color":"pink", "animal":["fish","mustang"]}
{"color":["orange","purple"], "animal":"panda"}
With a provided delimiter, e.g. /
to avoid collision with ,
for delimiting columns, I would like to delimit these in output for column, e.g.:
------------------------------------
color | animal | fruit
------------------------------------
pink | fish/mustang | null
orange/purple | panda | null
------------------------------------
Once there is an approach for the primary question, I'm confident I can work this part out, but throwing it out there anyhow as it will be a dimension of the problem.