0

I have a 64Bit, 4-core, 2.50GHz, 64GB system with 13GB free memory. I am trying to read 24 csv with around 40 mil rows with the code below;

def test():
    test = pd.DataFrame()
    rootdir ='/XYZ/A'
    for subdir, dirs, files in os.walk(rootdir):
        for file in files:
            df = pd.read_csv(os.path.join(subdir, file), low_memory=False)
            test = pd.concat([test, df])
    return test

How can I optimize this to run faster, without the kernel dying. Should I be implementing this in Pyspark instead??? Please let me know if I missed any detail.

AMC
  • 2,642
  • 7
  • 13
  • 35
Abhi
  • 123
  • 1
  • 11
  • First things first, don't call `pandas.concat()` in a loop. Add the DataFrames to a temporary data structure, like a list, and concatenate once at the end. If you're concerned with memory usage, why use `low_memory=False`? – AMC Mar 25 '20 at 18:33
  • Does this answer your question? ["Large data" work flows using pandas](https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas) – AMC Mar 25 '20 at 18:35
  • https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas – AMC Mar 25 '20 at 18:35
  • @AMC removed low memory=False, where would that temp storage come in my function ?? – Abhi Mar 25 '20 at 18:39
  • _Add the DataFrames to a temporary data structure, like a list, and concatenate once at the end._ – AMC Mar 25 '20 at 18:41
  • `def test(): test = pd.DataFrame() dataframeList = [] rootdir ='/XYZ/A' for subdir, dirs, files in os.walk(rootdir): for file in files: df = pd.read_csv(os.path.join(subdir, file)) dataframeList.append(df) test = pd.concat([test, df]) return test` Any other changes/sugestions? @AMC – Abhi Mar 25 '20 at 20:09
  • Should I call `pandas.concat()` outside the inner loop, in the outer loop?? – Abhi Mar 25 '20 at 20:13
  • You should concatenate once you have all the data/DataFrames, essentially. – AMC Mar 25 '20 at 20:15
  • @AMC So, I ran it again, just a 1 min difference. – Abhi Mar 25 '20 at 21:07
  • 1
    If you run in to memory problems you may want to consider dask https://dask.org/ – monkut Mar 26 '20 at 01:00

1 Answers1

0

have a go at this, i used the pathlib module, since it offers more succinct and clearer code IMHO and because u can take advantage of iterators and generator expressions :

from pathlib import Path
def test():
    rootdir ='/XYZ/A'
    #assumption is that they are all csvs
    #if not. u could just use rglob('*.*')
    #this will recursively search through the directory
    #and pull all files with the extension csv
    #or all files if u use ('*.*')
    #which might be a bit more intensive computation
    all_files = Path(rootdir).rglob('*.csv')
    all_dfs = (pd.read_csv(f)
               #kindly test this aspect and c
               #stem gets u the name before '.csv'
               #and returns a string
               #rsplit splits based on the last '_'
               .assign(Date = f.stem.rsplit('_')[-1])
               for f in all_files)
    #trying to delay the intense computation till it gets here
    #hence the use of generator expressions
    final_df = pd.concat(all_dfs,ignore_index=True)
    return final_df

let's know how it goes; if it fails, i'll take it off so as not to confuse others.

sammywemmy
  • 27,093
  • 4
  • 17
  • 31
  • 1
    Time reduced from 40 min to 1 min !! – Abhi Mar 26 '20 at 05:49
  • But I have an operation which takes last 8 characters of each csv file name and append that date into as new column in the csv read. `d = file.split('_')[-1].split('.')[0] d = datetime.datetime.strptime(date_for_file, "%d%m%Y").strftime("%Y%d%m") df['DATE'] = d ` Can this be adjusted the loop ?? @sammywemmy – Abhi Mar 26 '20 at 05:57
  • hey. glad it worked for u. can u share a list of say five filenames in ur question, so i can look at it and see if pathlib's features can possibly help – sammywemmy Mar 26 '20 at 06:02
  • All file names are basically in the format XX_XXX_26032020.csv and I am currently extracting the date as string (DDMMYYYY) and converting it into 20202603 (YYYYDDMM) – Abhi Mar 26 '20 at 06:11
  • 2 mins! @sammywemmy I just need to convert the date to YYYYDDMM. But, I don't think it should affect the run time too much. Thanks! – Abhi Mar 26 '20 at 19:31
  • My dataset just quadrupled. I was using ``` date_for_file = file.split('_')[-1].split('.')[0] date_for_file = datetime.datetime.strptime(date_for_file, "%d%m%Y").date()``` But since you suggested ```(pd.read_csv(f) .assign(Date = f.stem.rsplit('_')[-1]) for f in all_files)``` Can the date format change be incorporated in the same for loop? – Abhi Apr 24 '20 at 18:45
  • forgotten about this tbh ... but yeah, i think u could do that in the read_csv part, there is an infer date option in read csv; it might help with d speed. – sammywemmy Apr 24 '20 at 18:51
  • So, I am pulling the date from the name of the file. If I use the infer won't I have to get rid of the stemming part. Sorry bothering with this, but your solution works at super speed and I don't want to change it much. – Abhi Apr 24 '20 at 19:07
  • After the stem.rsplit I get the date in DDMMYYYY, desire it in YYYYDDMM, avoiding a new line of code. – Abhi Apr 24 '20 at 19:10
  • 1
    oh... sorry... i think you should run the code ```(pd.read_csv(f) .assign(Date = f.stem.rsplit('_')[-1]) for f in all_files``` as i suggested. once u have ur data in, then u can convert the Date column to your desired format using pd.to_datetime. – sammywemmy Apr 24 '20 at 19:22
  • Tried ```final_df['date'] = pd.to_datetime(final_df['date']).strftime("%Y%d%m")``` But to_datetime converts it into UNIX date and then conversion to YYYYDDMM with strftime doesn't work. – Abhi Apr 27 '20 at 15:41
  • Pd.to_datetime has a format option, try that instead of strftime – sammywemmy Apr 27 '20 at 21:28