Looping through csv files to create concatenated dataframe. Need to use date in file name to create new column for each file

Question

I have some code that looks like this:

import pandas as pd
import glob

root_directory = r"\\some\shared\directory"

all_files = glob.glob(f'{root_directory}\CPE_*.csv')

li = []

for filename in all_files:
    frame = pd.read_csv(filename, index_col=None, header=0, encoding='latin1')
    li.append(frame)

df = pd.concat(li, axis=0, ignore_index=True)

This code allows me to concatenate the data and create a master csv file, but I want to add a new column to each dataframe as I loop through them. The file names look something like this: CPE_02082020.csv , CPE_02092020 , etc. So the date is in the file name itself with the format of mmddyyyy. How do I add a date column to each file as I loop through and concatenate them?

each new file will add just a single column? Do the files have equal number of rows? — Mirronelli, Apr 01 '21 at 14:20
No all files do not have equal number of rows. Just looking to add 1 "date" column to this master dataframe. — Mike Mathews Jr, Apr 01 '21 at 14:22

score 3 · Accepted Answer · answered Apr 01 '21 at 15:57

To add a date column to each dataframe you can simply add a new column to frame while looping through all_files. This method was suggested by biobirdman on a previous post. You can then use split() to get just the date from the filename.

Try:

for filename in all_files:
    frame = pd.read_csv(filename, index_col=None, header=0, encoding='latin1')
    frame['date'] = filename.split('_')[1]
    li.append(frame)

df = pd.concat(li, axis=0, ignore_index=True)

Then to convert df['date'] to datetime add:

df['date'] = pd.to_datetime(df['date'])

score 0 · Answer 2 · edited Apr 09 '21 at 23:26

You can assign the column during a for loop, let's simplify a little using a list comprehension and Pathlib:

import pandas as pd 
from pathlib import Path 

root_dir = Path('\\some\shared\directory')

all_files = root_dir.glob('*CPE_*.csv')

dfs = [
    pd.read_csv(f, index_col=None, header=0, encoding='latin1')\
            .assign(date_col=
                pd.to_datetime(f.stem.split('_')[1],format='%m%d%Y',errors='coerce'))
    for f in all_files
]


print(pd.concat(dfs))

Looping through csv files to create concatenated dataframe. Need to use date in file name to create new column for each file

2 Answers2