How to store sensor data into Apache Hadoop HDFS, Hive, HBase or other

Question

Imagine you are reading millions of data rows from a CSV file. Each line shows the sensor-name, the current sensor-value and the timestamp when that value was observed.

key, value, timestamp
temp_x, 8°C, 10:52am
temp_x, 25°C, 11:02am
temp_x, 30°C, 11:12am

This relates to a signal like this:

So I wonder what's the best and most efficient way to store that into Apache Hadoop HDFS. First idea is using BigTable aka HBase. Here the signal name is the row-key while the value is a column-group that saves the values over time. One could add more column-groups (as for instance statistics) to that row-key.

Another idea is using a tabular (or SQL like) structure. But then you replicate the key in each row. And you have to calculate statistics on demand and store them separately (here into a second table).

I wonder if there is any better idea. Once stored, I want to read that data in Python/PySpark and do data analytics and machine learning. Therefore the data should be easily accessible using a schema (Spark RDD).

Rockie Yang · Accepted Answer · 2016-06-30T05:50:08.913

1

I would consider to use.

Load data from CSV file with databricks csv
Cleaning the data
Write to qarquet file (to save space and time)
Load data from parquet file
Analyse it
Perhaps save as intermediate result

edited Jun 30 '16 at 05:50

answered Jun 30 '16 at 05:38

Rockie Yang

4,725
31
34

Thanks. That's our current approach as well. – Matthias Jun 30 '16 at 05:51
Have you tried to save in Avro format to see the performance difference? – Rockie Yang Jun 30 '16 at 05:59
Yes, we tried that in different other projects and feels like Parquet is better in terms of performance. – Matthias Jun 30 '16 at 06:09
I think Parquet is suitable for most use case except, data in the same column varies a lot, and always analysed on almost all columns. Also find a good [SO answer](http://stackoverflow.com/questions/28957291/avro-v-s-parquet) – Rockie Yang Jun 30 '16 at 08:09
I have the same situation here, not sure how to handle it. So, is it reasonable to save the data as Parquet and load /analysis with sparksql or df/ds API? since it is time series data, shouldn't we store the data in any kind of nosql database as we have frequent random access to the data? – Amin Mohebi May 10 '18 at 05:26

How to store sensor data into Apache Hadoop HDFS, Hive, HBase or other

1 Answers1