1

I have 10K folders each with 200 records in 200 JSON format files. Trying to compile all records into one data frame then finally into a CSV (other format suggestions welcome)

Here is my working solution which takes around 8.3hrs just for the dataframe building process. (Not converting into CSV)

%%time
finalDf = pd.DataFrame()
rootdir ='/path/foldername'
all_files = Path(rootdir).rglob('*.json')
for filename in all_files:
    with open(filename, 'r+') as f:
        data = json.load(f)
        df = pd.json_normalize(data).drop(columns=[A]).rename(columns={'B': 'Date'})
        finalDf = finalDf.append(df, ignore_index=True)

Any suggestions to optimize this and bring the time down.

Abhi
  • 123
  • 1
  • 11
  • I found similar post. How about trying that? https://stackoverflow.com/questions/27407430/how-to-speed-up-process-of-loading-and-reading-json-files-in-python – akio.tanaka Jul 15 '20 at 01:46
  • Yup, will try Ultra JSON usage. But not very optimistic. – Abhi Jul 15 '20 at 02:31
  • What platform are you on? Does this need to run on Windows? – tdelaney Jul 15 '20 at 14:19
  • Is the goal just to write the CSV or are you going to process the full DF first? – tdelaney Jul 15 '20 at 14:20
  • There are other faster serializations such as feather, parquet, hdf file systems. And depending on what you want to do with the data long term, a nosql solution like mongdb or even good ole sql are good choices. Once you've imported these json, then you have a rich query capabilities and if this grows with more json over time, just keep importing as more come in. – tdelaney Jul 15 '20 at 16:24

2 Answers2

2

One important issue comes from the dataframe appending performed in O(n^2). Indeed, for each new processed json file, finalDf is entirely copied!

Here is a modified version running in O(n) time:

%%time
finalDf = pd.DataFrame()
rootdir ='/path/foldername'
all_files = Path(rootdir).rglob('*.json')
allDf = []
for filename in all_files:
    with open(filename, 'r+') as f:
        data = json.load(f)
        df = pd.json_normalize(data).drop(columns=[A]).rename(columns={'B': 'Date'})
        allDf.append(df)
finalDf = pd.concat(allDf, ignore_index=True)

If this not enough, the json parsing and pandas post-processings could be executed in parallel using the multiprocessing module.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • This certainly helps. Could you elaborate on the _json parsing and pandas post-processings could be executed in parallel using the multiprocessing module_ . Any implementations/code? – Abhi Jul 15 '20 at 12:42
1

If the goal is to just write the CSV, you can use multiprocessing to parallelize the read/deserialize/serialize steps and control the file writes with a lock. With a CSV you don't have to hold the whole thing in memory, just append each DF as its generated. If you are using hard drives instead of a ssd, you may also get a boost if the CSV is on a different drive (not just partition).

import multiprocessing as mp
import json
import pandas as pd
from pathlib import Path
import os

def update_csv(args):
    lock, infile, outfile = args
    with open(infile) as f:
        data = json.load(f)
    df = pd.json_normalize(data).drop(columns=[A]).rename(columns={'B': 'Date'})
    with lock:
        with open(outfile, mode="a", newline="") as f:
            df.to_csv(f)

if __name__ == "__main__":
    rootdir ='/path/foldername'
    outfile = 'myoutput.csv'
    if os.path.exists(outfile):
        os.remove(outfile)
    all_files = [str(p) for p in Path(rootdir).rglob('*.json')]
    mgr = mp.Manager()
    lock = mgr.Lock()
    # pool sizing is a bit of a guess....
    with mp.Pool(mp.cpu_count()-1) as pool:
        result = pool.map(update_csv, [(lock, fn, outfile) for fn in all_files],
            chunksize=1)

Personally, I prefer to use a file system lock file for this type of thing but that's platform dependent and you may have problems on some file system types (like a mounted remote file system). multiprocessing.Manager uses background synchronization - I'm not sure if its Lock is efficient or not. But good enough here.... it'll only be a minor % of costs.

Abhi
  • 123
  • 1
  • 11
tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • In the last line _result = pool.map.._ Passing this "args=[(lock, outfile, fn) for fn in all_files]" gives TypeError: map() got an unexpected keyword argument 'args'. Passing this "[(lock, outfile, fn) for fn in all_files]" gives RuntimeError: Lock objects should only be shared between processes through inheritance – Abhi Jul 15 '20 at 16:04
  • Okay, a few problems there.... I've posted an update. – tdelaney Jul 15 '20 at 16:14
  • Is this sequence _args=[(lock, outfile, fn)_ correct bc I am getting the same error some how. Even corrected, if **__name__ == "__main__":** and **exists** typos – Abhi Jul 15 '20 at 16:34
  • Okay, now it runs, at lest with 0 json files to process. – tdelaney Jul 15 '20 at 16:39
  • As the number of files increase, this program is slowing down. Any alternates to creating a CSV, maybe a txt file?(10GB-->1GBcsv in 2hrs) @tdelaney – Abhi Oct 20 '20 at 15:56
  • 1
    There are lots of other options but it depends on how you want to consume the data later. You could put the json into a SQL db, a non SQL db such as mongodb or couchdb, into a table in HDFS, use apache arrow. These all have ways to query and filter data when you read it so that you don't have to pull the entire dataset into memory. If you really want the entire dataset in memory, then formats like parquet and feather will be efficient for reading into pandas. – tdelaney Oct 20 '20 at 16:06
  • Estimating data to grow by 20-30 times. Looking for in-memory solution. For parquet or feather, only change has to be df.to_parquet/df.to_feather, right? Also, each time its reading a json header is being considered as a row, hence total records are doubled. – Abhi Oct 20 '20 at 19:50