2

I would like to read in multiple files from a directory, add an identifier for each file, and concatenate them into a single dataframe. Below is an example with two files but I am looking for a more efficient way to generalize the procedure to many files. I'd like to add an id column that identifies each file (so my question slightly differs from this)

Reprex:

# file1.csv
# file2.csv

import pandas as pd

df1 = pd.read_csv("file1.csv")
df["id"] = "file1"

df2 = pd.read_csv("file2.csv")
df2["id"] = "file2"


df_all = pd.concat[df, df2]

I would also like to use it for .txt files like so:

>>> %cat 'file1.txt'
A|B|C
1|0.5|good
2|0.2|bad
3|0.6|good
4|0.4|ok

>>> %cat 'file2.txt'
A|B|C
1|0.2|bad
2|0.5|ok
3|0.9|good
4|0.3|ok
5|0.7|bad
tall_table
  • 311
  • 3
  • 11

1 Answers1

1

You can use a dict to collect data from csv files and pathlib to walk into directory:

import pathlib

root_dir = pathlib.Path('.')
data = {}

for pth in root_dir.glob('*.csv'):
    data[pth.stem] = pd.read_csv(pth)

df = pd.concat(data).rename_axis(['id', None]).reset_index(level='id')

Example:

>>> %cat 'data1.csv'
A,B
1,0.5
2,0.2
3,0.6
4,0.4

>>> %cat 'data2.csv'
A,B
1,0.2
2,0.5
3,0.9
4,0.3
5,0.7

# final output
>>> df
      id  A    B
0  data2  1  0.2
1  data2  2  0.5
2  data2  3  0.9
3  data2  4  0.3
4  data2  5  0.7
0  data1  1  0.5
1  data1  2  0.2
2  data1  3  0.6
3  data1  4  0.4
Corralien
  • 109,409
  • 8
  • 28
  • 52