0

I have the following pyspark dataframe:

import pandas as pd
foo = pd.DataFrame({'id': ['a','a','a','a', 'b','b','b','b'],
                    'time': [1,2,3,4,1,2,3,4],
                    'col': ['1','2','1','2','3','2','3','2']})

foo_df = spark.createDataFrame(foo)
foo_df.show()

+---+----+---+
| id|time|col|
+---+----+---+
|  a|   1|  1|
|  a|   2|  2|
|  a|   3|  1|
|  a|   4|  2|
|  b|   1|  3|
|  b|   2|  2|
|  b|   3|  3|
|  b|   4|  2|
+---+----+---+

I would like to iterate over all ids and obtain a python dictionary that would have as keys the id and as values the col and would look like this:

foo_dict = {'a': ['1','2','1','2'], 'b': ['3','2','3','2']})

I have in total 10k ids and around 10m rows in foo, so I am looking for an efficient implementation.

Any ideas ?

quant
  • 4,062
  • 5
  • 29
  • 70

1 Answers1

0

It's a pandas dataframe. You should checkout the documentaton. The dataframe object has inbuilt methods to help iterate, slice and dice your data. There is also this fun tool to help you visualize what is going on.

pandas has a ready-made method to convert a dataframe to a dict.