2

I've got a CSV with a bunch of data. One of the columns, ExtraParams contains a JSON object. I want to extract a value using a specific key, but it's taking quite a while to get through the 60.000something rows in the CSV. Can it be sped up?

counter = 0 #just to see where I'm at

order_data['NewColumn'] = '' 

for row in range(len(total_data)):        
s = total_data['ExtraParams'][row]    
try:       
    data = json.loads(s)
    new_data = data['NewColumn']

    counter += 1
    print(counter)

    order_data['NewColumn'][row] = new_data
except:
    print('NewColumn not in row')

I use a try-except because a few of the rows have what I assume is messed up JSON, as they crash the program with a "expecting delimiter ','" error.

When I say "slow" I mean ~30mins for 60.000rows.

EDIT: It might be worth nothing each JSON contains about 35 key/value pairs.

Khaine775
  • 2,715
  • 8
  • 22
  • 51
  • can you share an example of your JSON? – Mohamed Ali JAMAOUI Aug 29 '17 at 13:08
  • 1
    Neither JSON nor CSV are made for large files. You should consider importing the data into a database and process it from there. It might be a good idea to use a DB server with JSON support. – Klaus D. Aug 29 '17 at 13:09
  • @KlausD. That's probably the best idea, yeah. I haven't been able to find anyone efficiently doing what I intend using solely Python/Pandas for a large dataset. [This](https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas) post might offer a solution though. – Khaine775 Aug 29 '17 at 13:26

1 Answers1

3

You could use something like pandas and make use of the apply method. For some simple sample data in test.csv

Col1,Col2,ExtraParams
1,"a",{"dog":10}
2,"b",{"dog":5}
3,"c",{"dog":6}

You could use something like

In [1]: import pandas as pd

In [2]: import json

In [3]: df = pd.read_csv("test.csv")

In [4]: df.ExtraParams.apply(json.loads)
Out[4]: 
0    {'dog': 10}
1    {'dog': 5}
2    {'dog': 6}
Name: ExtraParams, dtype: object

If you need to extract a field from the json, assuming the field is present in each row you can write a lambda function like

In [5]: df.ExtraParams.apply(lambda x: json.loads(x)['dog'])
Out[5]: 
0    10
1     5
2     6
Name: ExtraParams, dtype: int64
mgilbert
  • 3,495
  • 4
  • 22
  • 39