1

I have a pyspark dataframe with origin-dstination, date (year-month) and a list of JSONs for each date-origin-destination combination:

+---------+--------------+----------+--------------------+
|fs_origin|fs_destination|year-month|                JSON|
+---------+--------------+----------+--------------------+
|      TLV|           AUH|   2022-06|[{"fs_date":"2022...|
|      TLV|           AUH|   2022-07|[{"fs_date":"2022...|
|      TLV|           AUH|   2022-08|[{"fs_date":"2022...|
|      TLV|           AUH|   2022-09|[{"fs_date":"2022...|
|      TLV|           AUH|   2022-10|[{"fs_date":"2022...|
|      TLV|           AUH|   2022-11|[{"fs_date":"2022...|
|      TLV|           BAK|   2022-06|[{"fs_date":"2022...|
|      TLV|           BAK|   2022-07|[{"fs_date":"2022...|
|      TLV|           BAK|   2022-08|[{"fs_date":"2022...|
|      TLV|           BAK|   2022-09|[{"fs_date":"2022...|
|      TLV|           BAK|   2022-10|[{"fs_date":"2022...|
|      TLV|           BAK|   2022-11|[{"fs_date":"2022...|
|      TLV|           BER|   2022-06|[{"fs_date":"2022...|
|      TLV|           BER|   2022-07|[{"fs_date":"2022...|
|      TLV|           BER|   2022-08|[{"fs_date":"2022...|
|      TLV|           BER|   2022-09|[{"fs_date":"2022...|
|      TLV|           BER|   2022-10|[{"fs_date":"2022...|
|      TLV|           BER|   2022-11|[{"fs_date":"2022...|
+---------+--------------+----------+--------------------+

I want to turn it into a nested python dictionary, that contains the 'JSON' row by tear-month and origin-destination, something like this:

{
   "TLV-AUH": {
                 "2022-06": [{"fs_date":"2022...],
                 "2022-07": [{"fs_date":"2022...],
                 "2022-08": [{"fs_date":"2022...],
                 "2022-09": [{"fs_date":"2022...],
                 "2022-10": [{"fs_date":"2022...],
                 "2022-11": [{"fs_date":"2022...]
              },
   "TLV-BAK": {
                 "2022-06": [{"fs_date":"2022...],
                 "2022-07": [{"fs_date":"2022...],
                 "2022-08": [{"fs_date":"2022...],
                 "2022-09": [{"fs_date":"2022...],
                 "2022-10": [{"fs_date":"2022...],
                 "2022-11": [{"fs_date":"2022...]
              }
   "TLV-BER": {
                 "2022-06": [{"fs_date":"2022...],
                 "2022-07": [{"fs_date":"2022...],
                 "2022-08": [{"fs_date":"2022...],
                 "2022-09": [{"fs_date":"2022...],
                 "2022-10": [{"fs_date":"2022...],
                 "2022-11": [{"fs_date":"2022...]
              }
}

Thanks!

  • Does this answer your question? https://stackoverflow.com/a/38043364/19568102 – Jan Hrubec Jul 31 '22 at 08:35
  • @DanielAvigdor If you're going to extract all the data to dict anyway, you can use `df.toPandas()` first and then the pandas answer (which is now unfortunately deleted). – bzu Jul 31 '22 at 09:40

0 Answers0