I am currently building a system that uses the Hadoop ecosystem on a cluster to store our data and make it easily accessible and queryable. I have a few questions concerning how to store the input data that we want to make accessible with Hive.
Our data looks like this:
- Sensor data of heterogeneous Sensors
- Data is saved in time-based order.
- Data size: Several hundred GBs to TBs, each file is ~20-40 GB in size
- Current data format: Proprietary data format
- Binary files storing data of C++ Structs
- Each struct has a timestamp, and several different fields (some are nested too) for each Sensor message
I have written a data parser that is able to convert any of our data files to a new output format. We have thought about storing our data in CSV or JSON data format on the HDFS and then use the data with Hive.
My questions:
Is CSV a better fit than JSON when using Hive?
We are concerned about worse performance when choosing JSON. The file size overhead of JSON should also be considered. Are there other benefits when using CSV over JSON?
If we chose JSON, should we not have nested objects in our files?
Because we have yet to build the structure of the JSON files, we can still choose to not have these nested objects and rather have all fields in the root object and append the name of the nested object to the field names. Example: Field
nestedObjectName.nestedField1
instead of having real nested objects. I am aware of the fact that Hive does not like dots in its field names.If we chose JSON, which JSON SerDe should we use?
I've read that rcongiu's Hive-JSON SerDe might be the best one. The blog post is quite old (07/2013) and things might have changed.
If we chose CSV, should we create one big Table or several "smaller" Tables that allow different views on the data?
What I mean is that we have our C++ structs. These structs have about 200 unique field names at the moment. I don't think that having a big "compound" table is a good idea. Should we rather split our binary data files into several CSV files that each corresponds to one logical data group (i.e. all data of one sensor type - rain intensity sensor for example)? I could think of a possible downside: This might make querying "all" (as in all sensor type tables) data in a time range more complicated.
Some more information about our Cluster setup: We run a Hortonworks HDP Cluster with HDFS 2.7.3, Hive 1.2.1000, Spark 2.0.0, Zeppelin Notebook 0.6.0. We plan to keep our Stack updated.
If somebody thinks that there is a better data format than CSV and JSON, please also mention your idea. At best, the data format should be somewhat storage-effective, matured and have APIs in several programming languages (at least C++, Java, Python). We are open to input as we can still decide which data format to use. Thanks!