Pandas - import CSV files in folder, change column name if it contains a string, concat into one dataframe

Question

I have a folder with about 40 CSV files containing data by month. I want to combine this all together, however I have one column in these CSV files that are either denoted as 'implementationstatus' or 'implementation'. When I try to concat using Pandas, obviously this is a problem. I want to basically change 'implementationstatus' to 'implementation' for each CSV file as it is imported. I could run a loop for each CSV file, change the column name, export it, and then run my code again with everything changed, but that just seems prone to error or unexpected things happening.

Instead, I just want to import all the CSVs, change the column name 'implementationstatus' to 'implementation' IF APPLICABLE, and then concatenate into one data frame. My code is below.

import pandas as pd
import os
import glob

path = 'c:/mydata'

filepaths = [f for f in os.listdir(".") if f.endswith('.csv')]
df = pd.concat(map(pd.read_csv, filepaths),join='inner', ignore_index=True)
df.columns = df.columns.str.replace('implementationstatus', 'implementation') # I know this doesn't work, but I am trying to demonstrate what I want to do

prahasanam_boi · Accepted Answer · 2022-03-18T21:07:00.247

2

If you want to change the column name, please try this:

import glob
import pandas as pd

filenames = glob.glob('c:/mydata/*.csv')
all_data = []

for file in filenames:
    df = pd.read_csv(file)
    if 'implementationstatus' in df.columns:
      df = df.rename(columns={'implementationstatus':'implementation'})

    all_data.append(df)
df_all = pd.concat(all_data, axis=0)

edited Mar 18 '22 at 21:07

answered Mar 18 '22 at 19:15

prahasanam_boi

816
3
10

Wouldn't I need to change the column name as I am concatenating instead of after though? I posted the df.columns part AFTER just to show that I'm trying. Maybe I'm missing something. If you can embed within the full code that might be helpful. Thanks! – tenebris silentio Mar 18 '22 at 19:39
Can you give more info about the 40 datasets? Are they having the same number of columns other than the 'implementationstatus' one ? – prahasanam_boi Mar 18 '22 at 19:46
please let me know is this modified answer is what you want – prahasanam_boi Mar 18 '22 at 19:52
1

[Never call `DataFrame.append` or `pd.concat` inside a for-loop. It leads to quadratic copying.](https://stackoverflow.com/a/36489724/1422451). Instead append to a list and then concatenate *once* outside the loop. – Parfait Mar 18 '22 at 20:21
thank you for the info, I have modified the code – prahasanam_boi Mar 18 '22 at 21:07
Thank you. Sorry for the delay in response. These datasets will not necessarily have the same number of columns, but as long as the implementation column is in the final DF, that's fine. I don't need most of the other columns in the dataset. – tenebris silentio Mar 19 '22 at 20:29

score 0 · Answer 2 · answered Mar 18 '22 at 19:19

0

You can use a combination of header and names parameters from the pd.read_csv function to solve it.

You must pass to names a list containing the names for all columns on the csv files. This will allow you to standardize all names.

From pandas docs: names: array-like, optional List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.

answered Mar 18 '22 at 19:19

Miguel Benedetto

11
3

I'm still having some trouble - likely within the syntax. Would you be willing to show this within the code I posted? Thanks. – tenebris silentio Mar 18 '22 at 19:41

Pandas - import CSV files in folder, change column name if it contains a string, concat into one dataframe

2 Answers2