I am about doing some signal analysis with Hadoop/Spark and I need help on how to structure the whole process.
Signals are now stored in a database, that we will read with Sqoop and will be transformed in files on HDFS, with a schema similar to:
<Measure ID> <Source ID> <Measure timestamp> <Signal values>
where signal values are just string made of floating point comma separated numbers.
000123 S001 2015/04/22T10:00:00.000Z 0.0,1.0,200.0,30.0 ... 100.0
000124 S001 2015/04/22T10:05:23.245Z 0.0,4.0,250.0,35.0 ... 10.0
...
000126 S003 2015/04/22T16:00:00.034Z 0.0,0.0,200.0,00.0 ... 600.0
We would like to write interactive/batch queries to:
apply aggregation functions over signal values
SELECT *
FROM SIGNALS
WHERE MAX(VALUES) > 1000.0
To select signals that had a peak over 1000.0.
apply aggregation over aggregation
SELECT SOURCEID, MAX(VALUES)
FROM SIGNALS
GROUP BY SOURCEID
HAVING MAX(MAX(VALUES)) > 1500.0
To select sources having at least a single signal that exceeded 1500.0.
apply user defined functions over samples
SELECT *
FROM SIGNALS
WHERE MAX(LOW_BAND_FILTER("5.0 KHz", VALUES)) > 100.0)
to select signals that after being filtered at 5.0 KHz have at least a value over 100.0.
We need some help in order to:
- find the correct file format to write the signals data on HDFS. I thought to Apache Parquet. How would you structure the data?
- understand the proper approach to data analysis: is better to create different datasets (e.g. processing data with Spark and persisting results on HDFS) or trying to do everything at query time from the original dataset?
- is Hive a good tool to make queries such the ones I wrote? We are running on Cloudera Enterprise Hadoop, so we can also use Impala.
- In case we produce different derivated dataset from the original one, how we can keep track of the lineage of data, i.e. know how the data was generated from the original version?
Thank you very much!