Loop over all CSVs in a directory and merge the data using pandas

Question

I know how to merge CSVs by converting them into dataframes, combining the dfs, and outputting the .csv. I also know how to use glob to grab all of the CSVs in a directory. (See below.)

But I don't know how to traverse the dir to get the CSVs, and load the dfs into data_frames = [df1, df2, df3]. I suppose I need to somehow get each df out of the for loop and into the data_frames list on each iteration.

import csv
import datetime
import time
import pandas as pd
import glob
from functools import reduce


path = "/home/reallymemorable/Documents/scripts/*dateFormatted.csv"

df1 = pd.read_table('file1.csv', sep=',')
df2 = pd.read_table('file2.csv', sep=',')
df3 = pd.read_table('file3.csv', sep=',')

data_frames = [df1, df2, df3]

df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['DATE'], how='outer'), data_frames).fillna('NULL')

pd.DataFrame.to_csv(df_merged, 'merged.csv', sep=',', na_rep='.', index=False)


for fname in glob.glob(path):
    print(fname)

EDIT:

Using a suggestion below, I implemented this:

path = "/home/reallymemorable/Documents/scripts/"
files = [os.path.join(path, file) for file in os.listdir(path)]
df = pd.concat((pd.read_csv(f) for f in files if f.endswith('dateFormatted.csv')), ignore_index=True).reset_index()
print(df)
df.to_csv('TEST.csv', index=False)

But I am confused, because the print(df) is stacking the dataframes rather than joining them:

reallymemorable@gort:~/Documents/scripts$ python3 3-mai_ccxt_concatCrypto.py 
     index   Timestamp  BTC/USDT_Open  BTC/USDT_High  BTC/USDT_Low  ...  ETH/USDT_Open  ETH/USDT_High  ETH/USDT_Low  ETH/USDT_Close  ETH/USDT_Volume
0        0  2019-01-30        3410.04        3478.00       3387.10  ...            NaN            NaN           NaN             NaN              NaN
1        1  2019-01-31        3457.50        3489.20       3418.80  ...            NaN            NaN           NaN             NaN              NaN
2        2  2019-02-01        3434.10        3488.00       3401.20  ...            NaN            NaN           NaN             NaN              NaN
3        3  2019-02-02        3462.20        3526.40       3440.29  ...            NaN            NaN           NaN             NaN              NaN
4        4  2019-02-03        3504.06        3511.09       3426.00  ...            NaN            NaN           NaN             NaN              NaN
..     ...         ...            ...            ...           ...  ...            ...            ...           ...             ...              ...
995    995  2020-06-08            NaN            NaN           NaN  ...         244.57         247.70        240.59          246.40     383116.54719
996    996  2020-06-09            NaN            NaN           NaN  ...         246.37         249.82        238.00          243.80     420327.17133
997    997  2020-06-10            NaN            NaN           NaN  ...         243.79         250.28        242.00          247.78     431285.61715
998    998  2020-06-11            NaN            NaN           NaN  ...         247.78         250.09        226.20          230.51     816456.89161
999    999  2020-06-12            NaN            NaN           NaN  ...         230.46         239.38        228.19          236.76     415920.13123

[1000 rows x 12 columns]

Does this answer your question? [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) — wwnde, Jun 12 '20 at 04:09
I looked through that post but I did not see a solution to what I'm looking for. Maybe I missed it though. — reallymemorable, Jun 12 '20 at 04:10
@reallymemorable - I think the reason why the dataframes are being stacked is because you are using pd.concat, rather than pd.merge. — BGG16, Feb 18 '22 at 06:16

score 2 · Accepted Answer · answered Jun 12 '20 at 04:39

If I am at all guessing what you are asking, here's a refactoring which merges all the dataframes in the directory.

# Removed unused imports
import pandas as pd
import glob


data_frames = []

for fname in glob.glob("/home/reallymemorable/Documents/scripts/*dateFormatted.csv"):
   data_frames.append(pd.read_table('file1.csv', sep=','))

NotAName · Answer 2 · 2020-06-12T04:59:40.003

1

I prefer using os, but you can do all the same with glob if you prefer:

Note: here I'm using path as a folder where CSV files are located.

path = "/home/reallymemorable/Documents/scripts/"
files = [os.path.join(path, file) for file in os.listdir(path)]
df = pd.concat((pd.read_csv(f) for f in files if f.endswith('csv')), ignore_index=True).reset_index()

EDIT: If there are other (non-csv) files present in the folder you may want to add an if-statement into pd.concat as I've done in the code above.

edited Jun 12 '20 at 04:59

answered Jun 12 '20 at 04:54

NotAName

3,821
2
29
44

This outputs a `df` with the column headers correctly joined, but only the first file's data is in the subsequent rows. The data from other rows is not present. – reallymemorable Jun 12 '20 at 20:06
1

@reallymemorable, this code is meant to join csvs that all have same columns. If columns are different then you end up with lots on NaNs as you showed in your updated post. Depending on what your data is like and your exact intention you might want to use `merge`, `append` or `join` in place of `concat`. – NotAName Jun 13 '20 at 01:36

Loop over all CSVs in a directory and merge the data using pandas

2 Answers2