1

Imagine you are reading millions of data rows from a CSV file. Each line shows the sensor-name, the current sensor-value and the timestamp when that value was observed.

key, value, timestamp
temp_x, 8°C, 10:52am
temp_x, 25°C, 11:02am
temp_x, 30°C, 11:12am

This relates to a signal like this:

real world signal observations

So I wonder what's the best and most efficient way to store that into Apache Hadoop HDFS. First idea is using BigTable aka HBase. Here the signal name is the row-key while the value is a column-group that saves the values over time. One could add more column-groups (as for instance statistics) to that row-key.

stored in HBase

Another idea is using a tabular (or SQL like) structure. But then you replicate the key in each row. And you have to calculate statistics on demand and store them separately (here into a second table).

SQL like storage

I wonder if there is any better idea. Once stored, I want to read that data in Python/PySpark and do data analytics and machine learning. Therefore the data should be easily accessible using a schema (Spark RDD).

Matthias
  • 5,574
  • 8
  • 61
  • 121

1 Answers1

1

I would consider to use.

  • Load data from CSV file with databricks csv
  • Cleaning the data
  • Write to qarquet file (to save space and time)

  • Load data from parquet file

  • Analyse it
  • Perhaps save as intermediate result
Rockie Yang
  • 4,725
  • 31
  • 34
  • Thanks. That's our current approach as well. – Matthias Jun 30 '16 at 05:51
  • Have you tried to save in Avro format to see the performance difference? – Rockie Yang Jun 30 '16 at 05:59
  • Yes, we tried that in different other projects and feels like Parquet is better in terms of performance. – Matthias Jun 30 '16 at 06:09
  • I think Parquet is suitable for most use case except, data in the same column varies a lot, and always analysed on almost all columns. Also find a good [SO answer](http://stackoverflow.com/questions/28957291/avro-v-s-parquet) – Rockie Yang Jun 30 '16 at 08:09
  • I have the same situation here, not sure how to handle it. So, is it reasonable to save the data as Parquet and load /analysis with sparksql or df/ds API? since it is time series data, shouldn't we store the data in any kind of nosql database as we have frequent random access to the data? – Amin Mohebi May 10 '18 at 05:26