0

I have two sets of files b and c (JSON). The number of files in each is normally between 500-1000. Right now I am reading this seperately. Can I read these at the same time using multi-threading? I have enough memory and processors.

yc=no of c files
yb=no of b files

c_output_transaction_list =[]
for num in range(yc):
    c_json_file='./output/d_c_'+str(num)+'.json'
    print(c_json_file)
    c_transaction_list = json.load(open(c_json_file))['data']['transaction_list']
    c_output_transaction_list.extend(c_transaction_list)
df_res_c= pd.DataFrame(c_output_transaction_list) 


b_output_transaction_list =[]
for num in range(yb):
    b_json_file='./output/d_b_'+str(num)+'.json'
    print(b_json_file)
    b_transaction_list = json.load(open(b_json_file))['data']['transaction_list']
    b_output_transaction_list.extend(b_transaction_list)
df_res_b= pd.DataFrame(b_output_transaction_list) 
Manu Mohan
  • 167
  • 3
  • 17
  • 1
    Adding parallelism to I/O bound processing will only make it slower. – tripleee Apr 30 '21 at 07:31
  • Would this maybe answer your question? https://stackoverflow.com/a/4047840/15744261 - note the comments regarding performance on Linux vs. Windows. – Aelarion Apr 30 '21 at 18:19
  • Does this answer your question? [Parallel loading of Input Files in Pandas Dataframe](https://stackoverflow.com/questions/54309599/parallel-loading-of-input-files-in-pandas-dataframe) – Deepak May 01 '21 at 05:49

1 Answers1

0

I use this method to read hundreds of files in parallel into a final dataframe. Without having your data, you'll have to verify this does what you want. Reading the multiprocess help docs will assist. I use the same code on linux (aws ec2 reading s3 files) and windows reading the same s3 files. I find a big time savings do this.

import os
import pandas as pd
from multiprocessing import Pool
# you set the number of processors or just take the cpu_count from the os object. playing around with this does make a difference. For me using the max isn't always the fast overall time
num_proc = os.cpu_count()

# define the funciton that creates a dataframe from your file
# note, this is different where you build the list the create a dataframe at the end
def json_parse(c_json_file):
    c_transaction_list = json.load(open(c_json_file))['data']['transaction_list']
    return pd.DataFrame(c_transaction_list)

# this is multiprocessing function that feeds the file names to the parsing function
# if you don't pass num_proc it defaults to 4
def json_multiprocess(fn_list, num_proc=4):
    with Pool(num_proc) as pool:
        # I use starmap, you may just be able use map
        # if you pass more than the file name, starmap handles zip() very well
        r = pool.starmap(json_parse, fn_list, 15)
        pool.close()
        pool.join()
    return r

# build your file list first
yc=no of c files
flist = []
for num in range(yc):
    c_json_file='./output/d_c_'+str(num)+'.json'
    flist.append(c_json_file)

# get a list of of your intermediate dataframes
dfs = json_multiprocess(flist, num_proc=num_proc)
# concat your dataframe
df_res_c = pd.concat(dfs)

Then do the same for your next set of files... Use the example in Aelarion's comment to help structure the file

Jonathan Leon
  • 5,440
  • 2
  • 6
  • 14