1

I have a list of the form

results = [{'Parcel': {'AIN': '2004001003','Longitude': '-118.620668807','Latitude':'34.2202197879'}}, {'Parcel': {'AIN': '2004001004','Longitude': '-118.620668303','Latitude': '34.2200390973'}}]

I want to transform this into a dataframe with 3 columns 'AIN','Longitude','Latitude' I have tried this:

appended_data = pd.DataFrame()
for i in range(0,len(parcel_list)):
    appended_data = pd.concat([appended_data,pd.DataFrame((results[i].values()))])
appended_data

This seems to work, but in reality, I have a large list of about >500,000 obs so my approach takes forever. How can I speed this up?

Thank you!

cs95
  • 379,657
  • 97
  • 704
  • 746

2 Answers2

1

If the structure is consistent, it would be enough to unpack each "Parcel" inside a list comprehension

pd.DataFrame([result['Parcel'] for result in results])

          AIN       Longitude       Latitude
0  2004001003  -118.620668807  34.2202197879
1  2004001004  -118.620668303  34.2200390973

The output of the list comprehension is a list of records of the format

[{'AIN': '2004001003',
  'Longitude': '-118.620668807',
  'Latitude': '34.2202197879'},
 {'AIN': '2004001004',
  'Longitude': '-118.620668303',
  'Latitude': '34.2200390973'}]

which pd.DataFrame can work with.

By the way ... never grow a dataframe. Using pd.concat in a loop is quadratic space complexity.

cs95
  • 379,657
  • 97
  • 704
  • 746
1

You can use list comprehension like the below.

var = [x.get('Parcel') for x in results]

df= pd.DataFrame(var)
  • This is essentially using `x['Parcel']` except via a function call instead of native dict accessor syntax :( – cs95 Apr 11 '23 at 17:35
  • Your are correct. However, I added an alternative method to load dict into data frame – Sajil Alakkalakath Apr 11 '23 at 17:54
  • 1
    I'm sceptical about the claim that json_normalize is a faster alternative given its meant to handle very arbitrarily/complex structured json structures, passing it a list of records is using a jackhammer to kill a mosquito. have you timed the two methods and compared execution? – cs95 Apr 11 '23 at 18:18
  • 1
    You are correct indeed. Jason is no way faster. I tested on 2 million rows with 2 columns. Jason is way too slow. – Sajil Alakkalakath Apr 11 '23 at 18:54