How to iterate over a pyspark dataframe and create a dictionary out of it

Question

I have the following pyspark dataframe:

import pandas as pd
foo = pd.DataFrame({'id': ['a','a','a','a', 'b','b','b','b'],
                    'time': [1,2,3,4,1,2,3,4],
                    'col': ['1','2','1','2','3','2','3','2']})

foo_df = spark.createDataFrame(foo)
foo_df.show()

+---+----+---+
| id|time|col|
+---+----+---+
|  a|   1|  1|
|  a|   2|  2|
|  a|   3|  1|
|  a|   4|  2|
|  b|   1|  3|
|  b|   2|  2|
|  b|   3|  3|
|  b|   4|  2|
+---+----+---+

I would like to iterate over all ids and obtain a python dictionary that would have as keys the id and as values the col and would look like this:

foo_dict = {'a': ['1','2','1','2'], 'b': ['3','2','3','2']})

I have in total 10k ids and around 10m rows in foo, so I am looking for an efficient implementation.

Any ideas ?

score 0 · Answer 1 · answered May 16 '22 at 09:21

0

It's a pandas dataframe. You should checkout the documentaton. The dataframe object has inbuilt methods to help iterate, slice and dice your data. There is also this fun tool to help you visualize what is going on.

pandas has a ready-made method to convert a dataframe to a dict.

answered May 16 '22 at 09:21

Niall Farrington

115
1
9

As mentioned, the question concerns a pyspark dataframe, just initialized as pandas because it is faster. Edited the question, to avoid future confusion – quant May 16 '22 at 09:31
Just convert df.toPandas() and then use to_dict() – Niall Farrington May 16 '22 at 09:43
the pyspark dataframe has 10m rows, so the toPandas() will not work – quant May 16 '22 at 09:44
The map is the right choice here I think: https://stackoverflow.com/questions/36349281/how-to-loop-through-each-row-of-dataframe-in-pyspark – Niall Farrington May 16 '22 at 11:40

How to iterate over a pyspark dataframe and create a dictionary out of it

1 Answers1