2

I have 1300 csv files in a directory.

Each file has a date in the first column, followed by daily data for the last 20-30 years which spans another 8 columns.

So like this, Data1.csv

Date source1 source2 source3 source4 source5 source6 source 7 source 8

I have 1300 uniquely named files.

I am trying to merge all of these into one dataframe using pandas like this

import pandas as pd 
frame = pd.DataFrame()

length = len(os.listdir(filepath))
for filename in os.listdir(filepath):
    file_path = os.path.join(filepath, filename)
    print(length,end=" ")
    df = pd.read_csv(file_path,index_col=0)
    df = pd.concat([df[[col]].assign(Source=f'{filename[:-4]}-{col}').rename(columns={col: 'Data'}) for col in df])
    frame = frame.append(df)
    length-=1

But around the 300th file I have around 12 million rows and my code really slows down...

Is there a way to speed this up before my computer runs out of memory.

My goal is actually to have a massive dataframe which is 1+ (1300x8) by number of dates for 30 years.

anarchy
  • 3,709
  • 2
  • 16
  • 48
  • (a) Do you need 12M+ rows in memory at once? (b) Have you looked at [Dask](https://dask.org)? (c) May be time to look for a database solution? – Code Different Oct 16 '20 at 12:12
  • I actually want to combine all the frames Column wise to have one massive frame where only the dates are the index and so it would look like this Date data1-source1 data2-source2... data2-source1 data2-source2... etc, but when I try to merge it column wise I also run out of space – anarchy Oct 16 '20 at 12:14
  • @CodeDifferent Can you recommend me any database solutions which will let me keep one massive column wise database, I want a time series of dates to the index, meaning 30 years of Daily dates, and I want to be able to add new columns of data, as many as I want – anarchy Oct 17 '20 at 21:53

1 Answers1

1

The reason your loop slows down is because of at each .append(), the dataframe has to create a copy in order to allocate more memory, as described here.

If your memory can fit it all, you could first fill a list of fixed size(1300) with all data frames, and then use df = pd.concat(list_of_dataframes), which would probably avoid the issue you are having right now. Your code could be adjusted as such:

import pandas as pd 
lst = [None for _ in range(1300)] # Creates empty list

for i, filename in enumerate(os.listdir(filepath)):
    file_path = os.path.join(filepath, filename)
    df = pd.read_csv(file_path,index_col=0)
    df = pd.concat([df[[col]].assign(Source=f'{filename[:-4]}-{col}').rename(columns={col: 'Data'}) for col in df])
    lst[i] = df
    

frame = pd.concat(lst)
Marcus
  • 943
  • 5
  • 21