0

This is a sample of a real-world problem that I cannot find a way to solve.

I need to create a nested JSON from a pandas dataframe. Considering this data, I need to create a JSON object like that:

[
  {
    "city": "Belo Horizonte",
    "by_rooms": [
      {
        "rooms": 1,
        "total price": [
          {
            "total (R$)": 499,
            "details": [
              {
                "animal": "acept",
                "area": 22,
                "bathroom": 1,
                "parking spaces": 0,
                "furniture": "not furnished",
                "hoa (R$)": 30,
                "rent amount (R$)": 450,
                "property tax (R$)": 13,
                "fire insurance (R$)": 6
              }
            ]
          }
        ]
      },
      {
        "rooms": 2,
        "total price": [
          {
            "total (R$)": 678,
            "details": [
              {
                "animal": "not acept",
                "area": 50,
                "bathroom": 1,
                "parking spaces": 0,
                "furniture": "not furnished",
                "hoa (R$)": 0,
                "rent amount (R$)": 644,
                "property tax (R$)": 25,
                "fire insurance (R$)": 9
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "city": "Campinas",
    "by_rooms": [
      {
        "rooms": 1,
        "total price": [
          {
            "total (R$)": 711,
            "details": [
              {
                "animal": "acept",
                "area": 42,
                "bathroom": 1,
                "parking spaces": 0,
                "furniture": "not furnished",
                "hoa (R$)": 0,
                "rent amount (R$)": 690,
                "property tax (R$)": 12,
                "fire insurance (R$)": 9
              }
            ]
          }
        ]
      }
    ]
  }
]

each level can have one or more items.

Based on this answer, I have a snippet like that:

data = pd.read_csv("./houses_to_rent_v2.csv")
    cols = data.columns
    data = (
        data.groupby(['city', 'rooms', 'total (R$)'])[['animal', 'area', 'bathroom', 'parking spaces', 'furniture',
                                                       'hoa (R$)', 'rent amount (R$)', 'property tax (R$)', 'fire insurance (R$)']]
        .apply(lambda x: x.to_dict(orient='records'))
        .reset_index(name='details')
        .groupby(['city', 'rooms'])[['total (R$)', 'details']]
        .apply(lambda x: x.to_dict(orient='records'))
        .reset_index(name='total price')
        .groupby(['city'])[['rooms', 'total price']]
        .apply(lambda x: x.to_dict(orient='records'))
        .reset_index(name='by_rooms')
    )

    data.to_json('./jsondata.json', orient='records', force_ascii=False)

but all those groupbys don't look very Pythonic and it's pretty slow.

Before use this method, I tried split this big dataframe into smaller ones to use individual groupbys for each level, but it's even slower than doing that way.

I tried dask, with no improvement at all.

I read about numba and cython, but I have no idea how to implement in this case. All docs that I find use only numeric data and I have string and date/datetime data too.

In my real-world problem, this data is processed to response to http request. My dataframe has 30+ columns and ~35K rows per request and it takes 45 seconds to process just this snippet.

So, there is a faster way to do that?

2 Answers2

1

This can be done as list / dict comprehensions. Have not timed this, but I'm not waiting for it.

import kaggle.cli
import sys, requests
import pandas as pd
from pathlib import Path
from zipfile import ZipFile
import urllib

# fmt: off
# download data set
url = "https://www.kaggle.com/rubenssjr/brasilian-houses-to-rent"
sys.argv = [sys.argv[0]] + f"datasets download {urllib.parse.urlparse(url).path[1:]}".split(" ")
kaggle.cli.main()
zfile = ZipFile(f'{urllib.parse.urlparse(url).path.split("/")[-1]}.zip')
dfs = {f.filename: pd.read_csv(zfile.open(f)) for f in zfile.infolist()}
# fmt: on

js = [
    {
        "city": g[0],
        "by_room": [
            {
                "rooms": r["rooms"],
                "total_price": [
                    {
                        "total (R$)": r["total (R$)"],
                        "details": [
                            {
                                k: v
                                for k, v in r.items()
                                if k not in ["city", "rooms", "total (R$)"]
                            }
                        ],
                    }
                ],
            }
            for r in g[1].to_dict("records")
        ],
    }
    for g in dfs["houses_to_rent_v2.csv"].groupby("city")
]

print(len(js), len(js[0]["by_room"]))
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
0

I needed to adapt @RobRaymond answer, because I need the inner data grouped too. So I take his code, did some adjustments and this is the final result:

import kaggle.cli
import sys, requests
import pandas as pd
from pathlib import Path
from zipfile import ZipFile
import urllib

# fmt: off
# download data set
url = "https://www.kaggle.com/rubenssjr/brasilian-houses-to-rent"
sys.argv = [sys.argv[0]] + f"datasets download {urllib.parse.urlparse(url).path[1:]}".split(" ")
kaggle.cli.main()
zfile = ZipFile(f'{urllib.parse.urlparse(url).path.split("/")[-1]}.zip')
dfs = {f.filename: pd.read_csv(zfile.open(f)) for f in zfile.infolist()}
# fmt: on

js = [
    {
        "city": g[0],
        "by_room": [
            {
                "rooms": r["rooms"],
                "total_price": [
                    {
                        "total (R$)": r["total (R$)"],
                        "details": [
                            {
                                k: v
                                for k, v in r.items()
                                if k not in ["city", "rooms", "total (R$)"]
                            }
                        ],
                     }
                ],
            }
            for r in g[1].to_dict("records")
        ],
    }
    for g in dfs["houses_to_rent_v2.csv"].groupby("city")
]

for city in js:
    rooms_qty = list(set([r['rooms'] for r in city['by_room']]))
    newRooms = [{'rooms': x, 'total_price': []} for x in rooms_qty]
    for r in city['by_room']:
      newRooms[rooms_qty.index(r['rooms'])]'total_price'].extend(r['total_price'])
        for r in newRooms:
            prices = list(set([p['total (R$)'] for p in r['total_price']]))
            newPrices = [{'total (R$)': x, 'details': []} for x in prices]
            for price in r['total_price']:
                newPrices[prices.index(price['total (R$)'])]['details'].extend(price['details'])

        r['total_price'] = newPrices

    city['by_room'] = newRooms

And the execution time drops to 5 seconds.