2

I've been asked by a colleague to convert 6 huge files from the "Yelp Dataset Challenge" from somewhat "flat," regular JSON into CSV (he thinks they look like fun teaching data).

I thought I could bang it out with:

# With thanks to http://www.diveintopython3.net/files.html and https://www.reddit.com/r/MachineLearning/comments/33eglq/python_help_jsoncsv_pandas/cqkwyu8/

import os
import pandas

jsondir = 'c:\\example\\bigfiles\\'
csvdir = 'c:\\example\\bigcsvfiles\\'
if not os.path.exists(csvdir): os.makedirs(csvdir)

for file in os.listdir(jsondir):
    with open(jsondir+file, 'r', encoding='utf-8') as f: data = f.readlines()
    df = pandas.read_json('[' + ','.join(map(lambda x: x.rstrip(), data)) + ']')
    df.to_csv(csvdir+os.path.splitext(file)[0]+'.csv',index=0,quoting=1)

Unfortunately, my computer's memory isn't up to the task on this size of file. (Even if I get rid of the loop, although it bangs out a 50MB file in less than a minute, it struggles to avoid freezing my computer or crashing on the 100MB+ files, and the biggest file is 3.25GB.)

Is there something else simple but performant that I can run instead?

In a loop would be great, but I can also run 6 times w/ separate filenames if it makes a difference for memory (there are just 6 files).

Here's an example of the contents of a ".json" file -- note that each file actually has lots of JSON objects, 1 per line.

{"business_id":"xyzzy","name":"Business A","neighborhood":"","address":"XX YY ZZ","city":"Tempe","state":"AZ","postal_code":"85283","latitude":33.32823894longitude":-111.28948,"stars":3,"review_count":3,"is_open":0,"attributes":["BikeParking: True","BusinessAcceptsBitcoin: False","BusinessAcceptsCreditCards: True","BusinessParking: {'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}","DogsAllowed: False","RestaurantsPriceRange2: 2","WheelchairAccessible: True"],"categories":["Tobacco Shops","Nightlife","Vape Shops","Shopping"],"hours":["Monday 11:0-21:0","Tuesday 11:0-21:0","Wednesday 11:0-21:0","Thursday 11:0-21:0","Friday 11:0-22:0","Saturday 10:0-22:0","Sunday 11:0-18:0"],"type":"business"}
{"business_id":"dsfiuweio2f","name":"Some Place","neighborhood":"","address":"Strip or something","city":"Las Vegas","state":"NV","postal_code":"89106","latitude":36.189134,"longitude":-115.92094,"stars":1.5,"review_count":2,"is_open":1,"attributes":["BusinessAcceptsBitcoin: False","BusinessAcceptsCreditCards: True"],"categories":["Caterers","Grocery","Food","Event Planning & Services","Party & Event Planning","Specialty Food"],"hours":["Monday 0:0-0:0","Tuesday 0:0-0:0","Wednesday 0:0-0:0","Thursday 0:0-0:0","Friday 0:0-0:0","Saturday 0:0-0:0","Sunday 0:0-0:0"],"type":"business"}
{"business_id":"abccb","name":"La la la","neighborhood":"Blah blah","address":"Yay that","city":"Toronto","state":"ON","postal_code":"M6H 1L5","latitude":43.283984,"longitude":-79.28284,"stars":2,"review_count":6,"is_open":1,"attributes":["Alcohol: none","Ambience: {'romantic': False, 'intimate': False, 'classy': False, 'hipster': False, 'touristy': False, 'trendy': False, 'upscale': False, 'casual': False}","BikeParking: True","BusinessAcceptsCreditCards: True","BusinessParking: {'garage': False, 'street': False, 'validated': False, 'lot': False, 'valet': False}","Caters: True","GoodForKids: True","GoodForMeal: {'dessert': False, 'latenight': False, 'lunch': False, 'dinner': False, 'breakfast': False, 'brunch': False}","HasTV: True","NoiseLevel: quiet","OutdoorSeating: False","RestaurantsAttire: casual","RestaurantsDelivery: True","RestaurantsGoodForGroups: True","RestaurantsPriceRange2: 1","RestaurantsReservations: False","RestaurantsTableService: False","RestaurantsTakeOut: True","WiFi: free"],"categories":["Restaurants","Pizza","Chicken Wings","Italian"],"hours":["Monday 11:0-2:0","Tuesday 11:0-2:0","Wednesday 11:0-2:0","Thursday 11:0-3:0","Friday 11:0-3:0","Saturday 11:0-3:0","Sunday 11:0-2:0"],"type":"business"}

The nested JSON data can simply remain as the string literals representing it -- I'm only looking to convert top-level keys into CSV file headings.

k..
  • 401
  • 3
  • 11
  • Instead of **reading and parsing the whole file** at one time, you can try to **read it in one json dictionary or one csv row at one time**, then parse and insert it into the csv. It would require a little more manual coding, but will perform well with the file stream style. – Neo X Feb 10 '17 at 22:25

1 Answers1

4

The problem is that your code reads the entire file into memory, then creates a near-copy of it in memory. I suspect it also creates a third copy, but haven't verified that. The solution, as suggested by Neo X, is to read in the file line-by-line and process it accordingly. Here's a replacement for the for loop:

for file in os.listdir(jsondir):
    csv_file = csvdir + os.path.splitext(file)[0] + '.csv'
    with open(jsondir+file, 'r', encoding='utf-8') as f, open(csv_file, 'w', encoding='utf-8') as csv:
        header = True
        for line in f:
            df = pandas.read_json(''.join(('[', line.rstrip(), ']')))
            df.to_csv(csv, header=header, index=0, quoting=1)
            header = False

I've tested this using python 3.5 on a Mac; it should work on Windows but I haven't tested it there.

Notes:

  1. I've adjusted your json data; there appeared to be an error around latitude / longitude in the first line.

  2. This has only been tested with a small file; I'm not sure where to get your 3.5 GB file from.

  3. I'm assuming this is a one-time usage for your friend. If this were Production code, you would need to verify that exception handling for the 'with' statement is correct. See How can I open multiple files using "with open" in Python? for details.

  4. This should be fairly performant, but again, I'm not sure where to get your large files from.

Community
  • 1
  • 1
kevin
  • 111
  • 4
  • Check out [ijson](https://pypi.python.org/pypi/ijson/), it makes streaming a JSON file as easy as using a Python iterator – sundance Feb 11 '17 at 04:15
  • @kevin: Question: Why doesn't your `to_csv()` include a `mode='a'` parameter? Is there something about calling `to_csv()` inside a `with open` that makes it auto-append? Also, your code works beautifully -- converting a small file takes longer, but my computer no longer freezes, and the job still completes in a reasonable amount of time (it should easily be done by the end of the day), so I can just let it run in the background. Thanks a million. (Finally, I edited your code to include UTF-8 encoding on the output file--I was getting errors w/ foreign input data until I did that.) – k.. Feb 13 '17 at 18:09
  • @k.. Glad I could help! Since `csv` is already open and is passed into `to_csv()`, the latter doesn't close the file after writing. You can verify by looking at `def save()` in the source; it sets `close = False` at line 1476. https://github.com/pandas-dev/pandas/blob/master/pandas/formats/format.py. Good question! – kevin Feb 13 '17 at 19:20
  • there is the option `header=first` that eliminates two lines of your code – Lucas Azevedo Apr 27 '20 at 13:24