I have a huge amount of JSON
files that I need to transform into Parquet
. They look something like this:
{
"foo": "bar",
"props": {
"prop1": "val1",
"prop2": "val2"
}
}
And I need to transform them into a Parquet
file whose structure is this (nested properties are made top-level and get _
as a suffix):
foo=bar
_prop1=val1
_prop2=val2
Now here's the catch: not all of the JSON
documents have the same properties. So, if doc1 has prop1
and prop2
, but doc2 has prop3
, the final Parquet
file must have the three properties (some of them will be null for some of the records).
I understand that Parquet
needs a schema
up front, so my current plan is:
- Traverse all the
JSON
files - Infer a
schema
per document (using Kite, like this) - Merge all the
schemas
- Start writing the
Parquet
This approach strikes me as very complicated, slow and error-prone. I'm wondering if there's a better way to achieve this using Spark
.