If the goal is to just write the CSV, you can use multiprocessing to parallelize the read/deserialize/serialize steps and control the file writes with a lock. With a CSV you don't have to hold the whole thing in memory, just append each DF as its generated. If you are using hard drives instead of a ssd, you may also get a boost if the CSV is on a different drive (not just partition).
import multiprocessing as mp
import json
import pandas as pd
from pathlib import Path
import os
def update_csv(args):
lock, infile, outfile = args
with open(infile) as f:
data = json.load(f)
df = pd.json_normalize(data).drop(columns=[A]).rename(columns={'B': 'Date'})
with lock:
with open(outfile, mode="a", newline="") as f:
df.to_csv(f)
if __name__ == "__main__":
rootdir ='/path/foldername'
outfile = 'myoutput.csv'
if os.path.exists(outfile):
os.remove(outfile)
all_files = [str(p) for p in Path(rootdir).rglob('*.json')]
mgr = mp.Manager()
lock = mgr.Lock()
# pool sizing is a bit of a guess....
with mp.Pool(mp.cpu_count()-1) as pool:
result = pool.map(update_csv, [(lock, fn, outfile) for fn in all_files],
chunksize=1)
Personally, I prefer to use a file system lock file for this type of thing but that's platform dependent and you may have problems on some file system types (like a mounted remote file system). multiprocessing.Manager
uses background synchronization - I'm not sure if its Lock
is efficient or not. But good enough here.... it'll only be a minor % of costs.