1

I want to read 22 files (stored on my hard disk) with around 300,000 rows each to store in a single pandas data frame. My code was able to do it in 15-25 minutes. I initial thought is: I should make it faster using more CPUs. (correct me if I am wrong here, and if all CPU can't read the data from same hard disk at the same time, however, in this case we can assume data might be present at different hard disks later on, so this exercise is still useful).

I found few posts like this and this and tried the code below.

import os
import pandas as pd
from multiprocessing import Pool

def read_psv(filename):
    'reads one row of a file (pipe delimited) to a pandas dataframe'
    return pd.read_csv(filename,
                       delimiter='|',
                       skiprows=1, #need this as first row is junk
                       nrows=1, #Just one row for faster testing                    
                       encoding = "ISO-8859-1", #need this as well                       
                       low_memory=False
                      )



files = os.listdir('.') #getting all files, will use glob later
df1 = pd.concat((read_psv(f) for f in files[0:6]), ignore_index=True, axis=0, sort=False) #takes less than 1 second

pool = Pool(processes=3)
df_list = pool.map(read_psv, files[0:6]) #takes forever
#df2 =  pd.concat(df_list, ignore_index=True) #cant reach this 

This takes forever (more than 30-60 minutes, without finishing when I kill the process). I also went through a similar question like mine but of no use.

EDIT: I am using Jupyter on Windows.

Gaurav Singhal
  • 998
  • 2
  • 10
  • 25
  • I understood why it keeps on running forever. I was using this code on windows and it requires to define `pool` in a `if __name__ = '__main__':` clause. Otherwise it creates a runtime error. Please see this for more details. https://stackoverflow.com/questions/20222534/python-multiprocessing-on-windows-if-name-main – Gaurav Singhal Nov 21 '18 at 08:50

2 Answers2

2

Your task is IO-bound, the bottleneck is the hard-drive. The CPU has to do only a little work to parse each line in the CSV.

Disk reads are fastest when they are sequential. If you want to read a large file, it's best to let the disk seek the beginning and then just read all of its bytes sequentially.

If you have multiple large files on the same hard-drive and read from them using multiple processes, then the disk head will have to jump back and forth between them, where each jump takes up to 10 ms.

Multiprocessing can still make your code faster, but you will need to store your files on multiple disks, so each disk head can focus on reading one file.

Another alternative is to buy an SSD. Disk seek time is much lower at 0.1 ms and throughput is around 5x faster.

fafl
  • 7,222
  • 3
  • 27
  • 50
  • Thanks for the answer. My hard disk is SSD only. Also, in future I might be doing this on multiple hard disks. Also, the code breaks even when I try to read 6 files just with one row. – Gaurav Singhal Nov 21 '18 at 04:16
0

So the issue is not related to Bad Performance or getting stuck at I/O. The issue is related to Jupyter and Windows. On Windows we need to include an if clause like this: if __name__ = '__main__': before initializing the Pool. For Jupyter, we need to save the worker in a separate file and import it in the code. Jupyter is also problematic as it does not give the error log by default. I got to know about windows issue when I ran the code on a python shell. I got to know about Jupyter error when I ran the code on Ipython Shell. Following post has helped me a lot.

For Jupyter

For Windows Issue

Gaurav Singhal
  • 998
  • 2
  • 10
  • 25