Python Performance Tuning: JSON to CSV, big file

Question

I've been asked by a colleague to convert 6 huge files from the "Yelp Dataset Challenge" from somewhat "flat," regular JSON into CSV (he thinks they look like fun teaching data).

I thought I could bang it out with:

# With thanks to http://www.diveintopython3.net/files.html and https://www.reddit.com/r/MachineLearning/comments/33eglq/python_help_jsoncsv_pandas/cqkwyu8/

import os
import pandas

jsondir = 'c:\\example\\bigfiles\\'
csvdir = 'c:\\example\\bigcsvfiles\\'
if not os.path.exists(csvdir): os.makedirs(csvdir)

for file in os.listdir(jsondir):
    with open(jsondir+file, 'r', encoding='utf-8') as f: data = f.readlines()
    df = pandas.read_json('[' + ','.join(map(lambda x: x.rstrip(), data)) + ']')
    df.to_csv(csvdir+os.path.splitext(file)[0]+'.csv',index=0,quoting=1)

Unfortunately, my computer's memory isn't up to the task on this size of file. (Even if I get rid of the loop, although it bangs out a 50MB file in less than a minute, it struggles to avoid freezing my computer or crashing on the 100MB+ files, and the biggest file is 3.25GB.)

Is there something else simple but performant that I can run instead?

In a loop would be great, but I can also run 6 times w/ separate filenames if it makes a difference for memory (there are just 6 files).

Here's an example of the contents of a ".json" file -- note that each file actually has lots of JSON objects, 1 per line.

{"business_id":"xyzzy","name":"Business A","neighborhood":"","address":"XX YY ZZ","city":"Tempe","state":"AZ","postal_code":"85283","latitude":33.32823894longitude":-111.28948,"stars":3,"review_count":3,"is_open":0,"attributes":["BikeParking: True","BusinessAcceptsBitcoin: False","BusinessAcceptsCreditCards: True","BusinessParking: {'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}","DogsAllowed: False","RestaurantsPriceRange2: 2","WheelchairAccessible: True"],"categories":["Tobacco Shops","Nightlife","Vape Shops","Shopping"],"hours":["Monday 11:0-21:0","Tuesday 11:0-21:0","Wednesday 11:0-21:0","Thursday 11:0-21:0","Friday 11:0-22:0","Saturday 10:0-22:0","Sunday 11:0-18:0"],"type":"business"}
{"business_id":"dsfiuweio2f","name":"Some Place","neighborhood":"","address":"Strip or something","city":"Las Vegas","state":"NV","postal_code":"89106","latitude":36.189134,"longitude":-115.92094,"stars":1.5,"review_count":2,"is_open":1,"attributes":["BusinessAcceptsBitcoin: False","BusinessAcceptsCreditCards: True"],"categories":["Caterers","Grocery","Food","Event Planning & Services","Party & Event Planning","Specialty Food"],"hours":["Monday 0:0-0:0","Tuesday 0:0-0:0","Wednesday 0:0-0:0","Thursday 0:0-0:0","Friday 0:0-0:0","Saturday 0:0-0:0","Sunday 0:0-0:0"],"type":"business"}
{"business_id":"abccb","name":"La la la","neighborhood":"Blah blah","address":"Yay that","city":"Toronto","state":"ON","postal_code":"M6H 1L5","latitude":43.283984,"longitude":-79.28284,"stars":2,"review_count":6,"is_open":1,"attributes":["Alcohol: none","Ambience: {'romantic': False, 'intimate': False, 'classy': False, 'hipster': False, 'touristy': False, 'trendy': False, 'upscale': False, 'casual': False}","BikeParking: True","BusinessAcceptsCreditCards: True","BusinessParking: {'garage': False, 'street': False, 'validated': False, 'lot': False, 'valet': False}","Caters: True","GoodForKids: True","GoodForMeal: {'dessert': False, 'latenight': False, 'lunch': False, 'dinner': False, 'breakfast': False, 'brunch': False}","HasTV: True","NoiseLevel: quiet","OutdoorSeating: False","RestaurantsAttire: casual","RestaurantsDelivery: True","RestaurantsGoodForGroups: True","RestaurantsPriceRange2: 1","RestaurantsReservations: False","RestaurantsTableService: False","RestaurantsTakeOut: True","WiFi: free"],"categories":["Restaurants","Pizza","Chicken Wings","Italian"],"hours":["Monday 11:0-2:0","Tuesday 11:0-2:0","Wednesday 11:0-2:0","Thursday 11:0-3:0","Friday 11:0-3:0","Saturday 11:0-3:0","Sunday 11:0-2:0"],"type":"business"}

The nested JSON data can simply remain as the string literals representing it -- I'm only looking to convert top-level keys into CSV file headings.

Instead of **reading and parsing the whole file** at one time, you can try to **read it in one json dictionary or one csv row at one time**, then parse and insert it into the csv. It would require a little more manual coding, but will perform well with the file stream style. — Neo X, Feb 10 '17 at 22:25

score 4 · Accepted Answer · edited May 23 '17 at 11:53

4

The problem is that your code reads the entire file into memory, then creates a near-copy of it in memory. I suspect it also creates a third copy, but haven't verified that. The solution, as suggested by Neo X, is to read in the file line-by-line and process it accordingly. Here's a replacement for the for loop:

for file in os.listdir(jsondir):
    csv_file = csvdir + os.path.splitext(file)[0] + '.csv'
    with open(jsondir+file, 'r', encoding='utf-8') as f, open(csv_file, 'w', encoding='utf-8') as csv:
        header = True
        for line in f:
            df = pandas.read_json(''.join(('[', line.rstrip(), ']')))
            df.to_csv(csv, header=header, index=0, quoting=1)
            header = False

I've tested this using python 3.5 on a Mac; it should work on Windows but I haven't tested it there.

Notes:

I've adjusted your json data; there appeared to be an error around latitude / longitude in the first line.
This has only been tested with a small file; I'm not sure where to get your 3.5 GB file from.
I'm assuming this is a one-time usage for your friend. If this were Production code, you would need to verify that exception handling for the 'with' statement is correct. See How can I open multiple files using "with open" in Python? for details.
This should be fairly performant, but again, I'm not sure where to get your large files from.

edited May 23 '17 at 11:53

Community

1
1

answered Feb 11 '17 at 04:08

kevin

111
4

Check out [ijson](https://pypi.python.org/pypi/ijson/), it makes streaming a JSON file as easy as using a Python iterator – sundance Feb 11 '17 at 04:15
@kevin: Question: Why doesn't your `to_csv()` include a `mode='a'` parameter? Is there something about calling `to_csv()` inside a `with open` that makes it auto-append? Also, your code works beautifully -- converting a small file takes longer, but my computer no longer freezes, and the job still completes in a reasonable amount of time (it should easily be done by the end of the day), so I can just let it run in the background. Thanks a million. (Finally, I edited your code to include UTF-8 encoding on the output file--I was getting errors w/ foreign input data until I did that.) – k.. Feb 13 '17 at 18:09
@k.. Glad I could help! Since `csv` is already open and is passed into `to_csv()`, the latter doesn't close the file after writing. You can verify by looking at `def save()` in the source; it sets `close = False` at line 1476. https://github.com/pandas-dev/pandas/blob/master/pandas/formats/format.py. Good question! – kevin Feb 13 '17 at 19:20
there is the option `header=first` that eliminates two lines of your code – Lucas Azevedo Apr 27 '20 at 13:24

Python Performance Tuning: JSON to CSV, big file

1 Answers1

Linked