Merging 1300 data frames into a single frame column-wise becomes really slow

Question

This is a variation of an earlier question I posted, Merging 1300 data frames into a single frame becomes really slow

This time I am trying to merge the frames column wise not below one another.

I have 1300 csv files in a directory.

Each file has a date in the first column, followed by daily data for the last 20-30 years which spans another 8 columns.

So like this, Data1.csv

Date source1 source2 source3 source4 source5 source6 source 7 source 8

I have 1300 uniquely named files each having uniquely named columns.

I am trying to merge all of these into one dataframe using pandas column-wise, not appending below but merging left and right like this

import pandas as pd 
frame = pd.read_csv(file_path,index_col=0) #this is the location of the first file


for filename in os.listdir(filepath): #filepath has the rest of the files
    file_path = os.path.join(filepath, filename)
    df = pd.read_csv(file_path,index_col=0)
    df = df.groupby(['Date']).first()
    df = df.add_prefix(f"{filename}-")
    frame = pd.merge(frame, df, how='outer', left_index=True, right_index=True)
    length-=1

But around the 300th file my code really slows down...

I am merging these files column wise.

My goal is actually to have a massive dataframe which is 1+(1300x8) by number of dates for 30 years.

Is there a way to speed this up before my computer runs out of memory.

Didn't you you ask the question already, and there's a `concat` solution? — Quang Hoang, Oct 17 '20 at 21:38

Alexander · Accepted Answer · 2020-10-18T18:11:52.237

2

The reason that your code is slowing down is the same issue that is in your linked question: quadratic copy. In each loop, you are copying the entire existing dataframe plus some new data. The solution is to store all of the individual dataframes in a list, and then concatenate once all files have been read.

frame = []

for filename in os.listdir(filepath): #filepath has the rest of the files
    file_path = os.path.join(filepath, filename)
    df = (pd.read_csv(file_path, index_col=0)
          .groupby(['Date']).first()
          .add_prefix(f"{filename}-"))
    frame.append(df)

frame = pd.concat(frame, axis=1)

To illustrate the concept with some sample data:

df1 = pd.DataFrame(
    {'Date': ['2020-01-01', '2020-01-02', '2020-01-02', '2020-01-03'], 
     'A': [4, 5, 55, 6], 
     'B': [7, 8, 85, 9]}
).set_index('Date')
df2 = pd.DataFrame(
    {'Date': ['2020-01-02', '2020-01-03', '2020-01-04'], 
     'A': [40, 50, 60], 
     'C': [70, 80, 90]}
).set_index('Date')

frame = []

for n, df in enumerate([df1, df2]): #filepath has the rest of the files
    df = (df.groupby(level=['Date']).first()
          .add_prefix(f"{n}-"))
    frame.append(df)

frame = pd.concat(frame, axis=1)

>>> frame
            0-A  0-B   1-A   1-C
2020-01-01  4.0  7.0   NaN   NaN
2020-01-02  5.0  8.0  40.0  70.0
2020-01-03  6.0  9.0  50.0  80.0
2020-01-04  NaN  NaN  60.0  90.0

edited Oct 18 '20 at 18:11

answered Oct 17 '20 at 21:40

Alexander

105,104
32
201
196

won't this merge row wise? – anarchy Oct 17 '20 at 21:40
Add `axis=1` to merge column wise if that is what is desired., i.e. `pd.concat(frame, axis=1)` – Alexander Oct 17 '20 at 21:41
That is what the original question was asking for. – James Oct 17 '20 at 21:42
Why not use pathlib instead of os.path? – AMC Oct 18 '20 at 00:34
hey @Alexander, this is wrong, I need it to be a full outer join, joining on the date, I think this is an inner join – anarchy Oct 18 '20 at 11:56
Could you please add some sample data and expected result? – Alexander Oct 18 '20 at 15:56
I added a join="outer" – anarchy Oct 18 '20 at 17:56
The default value for `join=` is `'outer'`. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html – Alexander Oct 18 '20 at 18:00

Merging 1300 data frames into a single frame column-wise becomes really slow

1 Answers1