I want to read 22 files (stored on my hard disk) with around 300,000 rows each to store in a single pandas
data frame. My code was able to do it in 15-25 minutes. I initial thought is: I should make it faster using more CPUs. (correct me if I am wrong here, and if all CPU can't read the data from same hard disk at the same time, however, in this case we can assume data might be present at different hard disks later on, so this exercise is still useful).
I found few posts like this and this and tried the code below.
import os
import pandas as pd
from multiprocessing import Pool
def read_psv(filename):
'reads one row of a file (pipe delimited) to a pandas dataframe'
return pd.read_csv(filename,
delimiter='|',
skiprows=1, #need this as first row is junk
nrows=1, #Just one row for faster testing
encoding = "ISO-8859-1", #need this as well
low_memory=False
)
files = os.listdir('.') #getting all files, will use glob later
df1 = pd.concat((read_psv(f) for f in files[0:6]), ignore_index=True, axis=0, sort=False) #takes less than 1 second
pool = Pool(processes=3)
df_list = pool.map(read_psv, files[0:6]) #takes forever
#df2 = pd.concat(df_list, ignore_index=True) #cant reach this
This takes forever (more than 30-60 minutes, without finishing when I kill the process). I also went through a similar question like mine but of no use.
EDIT: I am using Jupyter on Windows.