PySpark: Create Instants DataFrame from sequential observation dataframe

Question

Imagine you got a Dataframe containing value observations of variables. Each observation is saved as a triple Variable, Timestamp, Value. This layout is somewhat a "observation dataframe".

#Variable     Time                Value
#852-YF-007   2016-05-10 23:00:00 4
#852-YF-007   2016-05-11 04:00:00 4
#...
#852-YF-008   2016-05-10 23:00:00 5
#852-YF-008   2016-05-11 04:00:00 3
#...
#852-YF-009   2016-05-10 23:00:00 2
#852-YF-009   2016-05-11 04:00:00 9
#...

That data is loaded into a Spark Dataframe and the timestamps are sampled so that we have one value for each variable for a specific timestamp.

Question: How can I convert/transpose that efficiently into a "Instants Dataframe" like this:

#Time                    852-YF-007     852-YF-008     852-YF-009
#2016-05-10 23:00:00     4              5              2
#2016-05-11 04:00:00     4              3              9
#...

The number of columns depends on the number of variables. Each column is the timeseries (all sampled values for that variables) while the rows are the timestamps. Note: the number of timestamps will be much larger than the number of variables.

Update: It's related to pivot-tables but I do not have a fixed number of columns. That number varies by the number of variables.

Possible duplicate of [Reshaping/Pivoting data in Spark RDD and/or Spark DataFrames](http://stackoverflow.com/questions/30260015/reshaping-pivoting-data-in-spark-rdd-and-or-spark-dataframes) — , Aug 30 '16 at 20:46
Hmm, not that sure because that example you mentioned has a fixed schema and fixed number of columns. Maybe someone can give a sample using spark-ts? — Matthias, Aug 30 '16 at 22:03

PySpark: Create Instants DataFrame from sequential observation dataframe

0 Answers0