Importing multiple .cnv files, extract filename, and attach as a column

Question

I have multiple .cnv files that I can import and concatenate with the following code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob

# Get CSV files list from a folder
path ='/Users/mariacristinaalvarez/Documents/High3/'
csv_files = glob.glob(path + "/*_high3.cnv")

# Read each CSV file into DataFrame
# This creates a list of dataframes
df_list = (pd.read_csv(file, encoding="ISO-8859-1",delim_whitespace=True,skiprows=316 , header=None) for file in csv_files1)

# Concatenate all DataFrames
stations_df = pd.concat(df_list)

This code works for me to get all files concatenated into one, but I would like to have a column with the filename that is contained between '' . For example: filename = 'HLY2202_008_high3_predown_av1dbar.cnv' I would like to only extract the numbers after HLY2202 AND before _high3 So the return should be "008" I want to do this for each file and add the name as a column so it becomes a identifier when I do explorative data analysis.

The fact it's a cnv vs csv isn't relevant, as you're using `.read_csv`. This is already in the [accepted answer](https://stackoverflow.com/a/21232849/7758804) in this [duplicate](https://stackoverflow.com/q/20906474/7758804) — Trenton McKinney, Apr 17 '23 at 20:02

Corralien · Answer 1 · 2023-04-13T19:11:11.100

You can create a dict of dataframes instead of a list. The key will be used by pd.concat to create a new index:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pathlib

# Use high-level module pathlib rather than os/glob
path = pathlib.Path('/Users/mariacristinaalvarez/Documents/High3/')
csv_files = path.glob('*_high3_*.cnv')

# CSV parameters
params = {'encoding': 'ISO-8859-1', 'delim_whitespace': True, 
          'skiprows': 316, 'header': None}

stations_df = (pd.concat({file.stem.split('_')[1]: pd.read_csv(file, **params) 
                          for file in csv_files}, names=['id'], axis=0)
                 .reset_index(level='id').reset_index(drop=True))

Output:

>>> stations_df

    id        date  value
0  008  2020-01-01     35  # file HLY2202_008_high3_predown_av1dbar.cnv
1  008  2020-01-02     40
2  008  2020-01-03     45
3  009  2020-01-01     35  # HLY2202_009_high3_predown_av1dbar.cnv
4  009  2020-01-02     40
5  009  2020-01-03     45

score 1 · Accepted Answer · answered Apr 13 '23 at 18:16

Use pathlib and assign the filename/substring you need to a new column.

import pandas as pd
from pathlib import Path

path = Path("/Users/mariacristinaalvarez/Documents/High3/")
csv_files = path.glob("*_high3.cnv")

df_list = [pd.read_csv(file, encoding="ISO-8859-1", delim_whitespace=True, skiprows=316, header=None)
               .assign(filename=file.stem.split("_")[1]) for file in csv_files]

stations_df = pd.concat(df_list)

Importing multiple .cnv files, extract filename, and attach as a column

2 Answers2