This is a sample of a real-world problem that I cannot find a way to solve.
I need to create a nested JSON from a pandas dataframe. Considering this data, I need to create a JSON
object like that:
[
{
"city": "Belo Horizonte",
"by_rooms": [
{
"rooms": 1,
"total price": [
{
"total (R$)": 499,
"details": [
{
"animal": "acept",
"area": 22,
"bathroom": 1,
"parking spaces": 0,
"furniture": "not furnished",
"hoa (R$)": 30,
"rent amount (R$)": 450,
"property tax (R$)": 13,
"fire insurance (R$)": 6
}
]
}
]
},
{
"rooms": 2,
"total price": [
{
"total (R$)": 678,
"details": [
{
"animal": "not acept",
"area": 50,
"bathroom": 1,
"parking spaces": 0,
"furniture": "not furnished",
"hoa (R$)": 0,
"rent amount (R$)": 644,
"property tax (R$)": 25,
"fire insurance (R$)": 9
}
]
}
]
}
]
},
{
"city": "Campinas",
"by_rooms": [
{
"rooms": 1,
"total price": [
{
"total (R$)": 711,
"details": [
{
"animal": "acept",
"area": 42,
"bathroom": 1,
"parking spaces": 0,
"furniture": "not furnished",
"hoa (R$)": 0,
"rent amount (R$)": 690,
"property tax (R$)": 12,
"fire insurance (R$)": 9
}
]
}
]
}
]
}
]
each level can have one or more items.
Based on this answer, I have a snippet like that:
data = pd.read_csv("./houses_to_rent_v2.csv")
cols = data.columns
data = (
data.groupby(['city', 'rooms', 'total (R$)'])[['animal', 'area', 'bathroom', 'parking spaces', 'furniture',
'hoa (R$)', 'rent amount (R$)', 'property tax (R$)', 'fire insurance (R$)']]
.apply(lambda x: x.to_dict(orient='records'))
.reset_index(name='details')
.groupby(['city', 'rooms'])[['total (R$)', 'details']]
.apply(lambda x: x.to_dict(orient='records'))
.reset_index(name='total price')
.groupby(['city'])[['rooms', 'total price']]
.apply(lambda x: x.to_dict(orient='records'))
.reset_index(name='by_rooms')
)
data.to_json('./jsondata.json', orient='records', force_ascii=False)
but all those groupby
s don't look very Pythonic and it's pretty slow.
Before use this method, I tried split this big dataframe into smaller ones to use individual groupby
s for each level, but it's even slower than doing that way.
I tried dask
, with no improvement at all.
I read about numba and cython, but I have no idea how to implement in this case. All docs that I find use only numeric data and I have string and date/datetime data too.
In my real-world problem, this data is processed to response to http request. My dataframe has 30+ columns and ~35K rows per request and it takes 45 seconds to process just this snippet.
So, there is a faster way to do that?