A have some data that look like this
from local_spark import sc,sqlContext
rdd = sc.parallelize([
("key1", 'starttime=10/01/2015', 'param1', '1,2,3,99,88'),
("key2", 'starttime=10/02/2015'', 'param1', '11,12'),
("key1", 'starttime=10/01/2015'', 'param2', '66,77')
])
The third parameter is a comma-separated (one value per second) list of values that can be very huge
What I need to do is to group the dataset by key and then flapMap it. The expected result would be something like this:
(key1) # rdd key
# for each key, a table with the values
key timestamp param1 param2
key1 10/01/2015 1 66
key1 10/01/2015 2 77
key1 10/01/2015 3 null
key1 10/01/2015 99 null
(key2) # rdd key
key timestamp param1 param2
key2 10/01/2015 11 null
key2 10/01/2015 12 null
So far, what I have tried to do is something like this: rdd.groupByKey().flatMap(functionToParseValuesAndTimeStamps)
If I do something like this, would the results of the flatMap operation be still grouped by the key? Would I "loose the group by" operation?
obs: a more naive approach would be flapMap first, and then group by key. But since there is much less key values than values, I think this would result in poor performance