I have the following pyspark
dataframe:
import pandas as pd
foo = pd.DataFrame({'id': ['a','a','a','a', 'b','b','b','b'],
'time': [1,2,3,4,1,2,3,4],
'col': ['1','2','1','2','3','2','3','2']})
foo_df = spark.createDataFrame(foo)
foo_df.show()
+---+----+---+
| id|time|col|
+---+----+---+
| a| 1| 1|
| a| 2| 2|
| a| 3| 1|
| a| 4| 2|
| b| 1| 3|
| b| 2| 2|
| b| 3| 3|
| b| 4| 2|
+---+----+---+
I would like to iterate over all id
s and obtain a python dictionary that would have as keys
the id
and as values
the col
and would look like this:
foo_dict = {'a': ['1','2','1','2'], 'b': ['3','2','3','2']})
I have in total 10k
id
s and around 10m
rows in foo
, so I am looking for an efficient implementation.
Any ideas ?