Reading different set of json files same time with python

Question

I have two sets of files b and c (JSON). The number of files in each is normally between 500-1000. Right now I am reading this seperately. Can I read these at the same time using multi-threading? I have enough memory and processors.

yc=no of c files
yb=no of b files

c_output_transaction_list =[]
for num in range(yc):
    c_json_file='./output/d_c_'+str(num)+'.json'
    print(c_json_file)
    c_transaction_list = json.load(open(c_json_file))['data']['transaction_list']
    c_output_transaction_list.extend(c_transaction_list)
df_res_c= pd.DataFrame(c_output_transaction_list) 


b_output_transaction_list =[]
for num in range(yb):
    b_json_file='./output/d_b_'+str(num)+'.json'
    print(b_json_file)
    b_transaction_list = json.load(open(b_json_file))['data']['transaction_list']
    b_output_transaction_list.extend(b_transaction_list)
df_res_b= pd.DataFrame(b_output_transaction_list)

Adding parallelism to I/O bound processing will only make it slower. — tripleee, Apr 30 '21 at 07:31
Would this maybe answer your question? https://stackoverflow.com/a/4047840/15744261 - note the comments regarding performance on Linux vs. Windows. — Aelarion, Apr 30 '21 at 18:19
Does this answer your question? [Parallel loading of Input Files in Pandas Dataframe](https://stackoverflow.com/questions/54309599/parallel-loading-of-input-files-in-pandas-dataframe) — Deepak, May 01 '21 at 05:49

score 0 · Answer 1 · answered May 01 '21 at 00:40

I use this method to read hundreds of files in parallel into a final dataframe. Without having your data, you'll have to verify this does what you want. Reading the multiprocess help docs will assist. I use the same code on linux (aws ec2 reading s3 files) and windows reading the same s3 files. I find a big time savings do this.

import os
import pandas as pd
from multiprocessing import Pool
# you set the number of processors or just take the cpu_count from the os object. playing around with this does make a difference. For me using the max isn't always the fast overall time
num_proc = os.cpu_count()

# define the funciton that creates a dataframe from your file
# note, this is different where you build the list the create a dataframe at the end
def json_parse(c_json_file):
    c_transaction_list = json.load(open(c_json_file))['data']['transaction_list']
    return pd.DataFrame(c_transaction_list)

# this is multiprocessing function that feeds the file names to the parsing function
# if you don't pass num_proc it defaults to 4
def json_multiprocess(fn_list, num_proc=4):
    with Pool(num_proc) as pool:
        # I use starmap, you may just be able use map
        # if you pass more than the file name, starmap handles zip() very well
        r = pool.starmap(json_parse, fn_list, 15)
        pool.close()
        pool.join()
    return r

# build your file list first
yc=no of c files
flist = []
for num in range(yc):
    c_json_file='./output/d_c_'+str(num)+'.json'
    flist.append(c_json_file)

# get a list of of your intermediate dataframes
dfs = json_multiprocess(flist, num_proc=num_proc)
# concat your dataframe
df_res_c = pd.concat(dfs)

Then do the same for your next set of files... Use the example in Aelarion's comment to help structure the file

Reading different set of json files same time with python

1 Answers1