Optimizing os.walk() iterations; On a 64Bit, 4-core, 64 GB, 2.50GHz system

Question

I have a 64Bit, 4-core, 2.50GHz, 64GB system with 13GB free memory. I am trying to read 24 csv with around 40 mil rows with the code below;

def test():
    test = pd.DataFrame()
    rootdir ='/XYZ/A'
    for subdir, dirs, files in os.walk(rootdir):
        for file in files:
            df = pd.read_csv(os.path.join(subdir, file), low_memory=False)
            test = pd.concat([test, df])
    return test

How can I optimize this to run faster, without the kernel dying. Should I be implementing this in Pyspark instead??? Please let me know if I missed any detail.

First things first, don't call `pandas.concat()` in a loop. Add the DataFrames to a temporary data structure, like a list, and concatenate once at the end. If you're concerned with memory usage, why use `low_memory=False`? — AMC, Mar 25 '20 at 18:33
Does this answer your question? ["Large data" work flows using pandas](https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas) — AMC, Mar 25 '20 at 18:35
https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas — AMC, Mar 25 '20 at 18:35
@AMC removed low memory=False, where would that temp storage come in my function ?? — Abhi, Mar 25 '20 at 18:39
_Add the DataFrames to a temporary data structure, like a list, and concatenate once at the end._ — AMC, Mar 25 '20 at 18:41
`def test(): test = pd.DataFrame() dataframeList = [] rootdir ='/XYZ/A' for subdir, dirs, files in os.walk(rootdir): for file in files: df = pd.read_csv(os.path.join(subdir, file)) dataframeList.append(df) test = pd.concat([test, df]) return test` Any other changes/sugestions? @AMC — Abhi, Mar 25 '20 at 20:09
Should I call `pandas.concat()` outside the inner loop, in the outer loop?? — Abhi, Mar 25 '20 at 20:13
You should concatenate once you have all the data/DataFrames, essentially. — AMC, Mar 25 '20 at 20:15
If you run in to memory problems you may want to consider dask https://dask.org/ — monkut, Mar 26 '20 at 01:00

sammywemmy · Accepted Answer · 2020-03-26T06:28:36.640

0

have a go at this, i used the pathlib module, since it offers more succinct and clearer code IMHO and because u can take advantage of iterators and generator expressions :

from pathlib import Path
def test():
    rootdir ='/XYZ/A'
    #assumption is that they are all csvs
    #if not. u could just use rglob('*.*')
    #this will recursively search through the directory
    #and pull all files with the extension csv
    #or all files if u use ('*.*')
    #which might be a bit more intensive computation
    all_files = Path(rootdir).rglob('*.csv')
    all_dfs = (pd.read_csv(f)
               #kindly test this aspect and c
               #stem gets u the name before '.csv'
               #and returns a string
               #rsplit splits based on the last '_'
               .assign(Date = f.stem.rsplit('_')[-1])
               for f in all_files)
    #trying to delay the intense computation till it gets here
    #hence the use of generator expressions
    final_df = pd.concat(all_dfs,ignore_index=True)
    return final_df

let's know how it goes; if it fails, i'll take it off so as not to confuse others.

edited Mar 26 '20 at 06:28

answered Mar 26 '20 at 00:03

sammywemmy

27,093
4
17
31

1

Time reduced from 40 min to 1 min !! – Abhi Mar 26 '20 at 05:49
But I have an operation which takes last 8 characters of each csv file name and append that date into as new column in the csv read. `d = file.split('_')[-1].split('.')[0] d = datetime.datetime.strptime(date_for_file, "%d%m%Y").strftime("%Y%d%m") df['DATE'] = d ` Can this be adjusted the loop ?? @sammywemmy – Abhi Mar 26 '20 at 05:57
hey. glad it worked for u. can u share a list of say five filenames in ur question, so i can look at it and see if pathlib's features can possibly help – sammywemmy Mar 26 '20 at 06:02
All file names are basically in the format XX_XXX_26032020.csv and I am currently extracting the date as string (DDMMYYYY) and converting it into 20202603 (YYYYDDMM) – Abhi Mar 26 '20 at 06:11
2 mins! @sammywemmy I just need to convert the date to YYYYDDMM. But, I don't think it should affect the run time too much. Thanks! – Abhi Mar 26 '20 at 19:31
My dataset just quadrupled. I was using ``` date_for_file = file.split('_')[-1].split('.')[0] date_for_file = datetime.datetime.strptime(date_for_file, "%d%m%Y").date()``` But since you suggested ```(pd.read_csv(f) .assign(Date = f.stem.rsplit('_')[-1]) for f in all_files)``` Can the date format change be incorporated in the same for loop? – Abhi Apr 24 '20 at 18:45
forgotten about this tbh ... but yeah, i think u could do that in the read_csv part, there is an infer date option in read csv; it might help with d speed. – sammywemmy Apr 24 '20 at 18:51
So, I am pulling the date from the name of the file. If I use the infer won't I have to get rid of the stemming part. Sorry bothering with this, but your solution works at super speed and I don't want to change it much. – Abhi Apr 24 '20 at 19:07
After the stem.rsplit I get the date in DDMMYYYY, desire it in YYYYDDMM, avoiding a new line of code. – Abhi Apr 24 '20 at 19:10
1

oh... sorry... i think you should run the code ```(pd.read_csv(f) .assign(Date = f.stem.rsplit('_')[-1]) for f in all_files``` as i suggested. once u have ur data in, then u can convert the Date column to your desired format using pd.to_datetime. – sammywemmy Apr 24 '20 at 19:22
Tried ```final_df['date'] = pd.to_datetime(final_df['date']).strftime("%Y%d%m")``` But to_datetime converts it into UNIX date and then conversion to YYYYDDMM with strftime doesn't work. – Abhi Apr 27 '20 at 15:41
Pd.to_datetime has a format option, try that instead of strftime – sammywemmy Apr 27 '20 at 21:28

Optimizing os.walk() iterations; On a 64Bit, 4-core, 64 GB, 2.50GHz system

1 Answers1