0

I have a folder with about 40 CSV files containing data by month. I want to combine this all together, however I have one column in these CSV files that are either denoted as 'implementationstatus' or 'implementation'. When I try to concat using Pandas, obviously this is a problem. I want to basically change 'implementationstatus' to 'implementation' for each CSV file as it is imported. I could run a loop for each CSV file, change the column name, export it, and then run my code again with everything changed, but that just seems prone to error or unexpected things happening.

Instead, I just want to import all the CSVs, change the column name 'implementationstatus' to 'implementation' IF APPLICABLE, and then concatenate into one data frame. My code is below.

import pandas as pd
import os
import glob

path = 'c:/mydata'

filepaths = [f for f in os.listdir(".") if f.endswith('.csv')]
df = pd.concat(map(pd.read_csv, filepaths),join='inner', ignore_index=True)
df.columns = df.columns.str.replace('implementationstatus', 'implementation') # I know this doesn't work, but I am trying to demonstrate what I want to do
petezurich
  • 9,280
  • 9
  • 43
  • 57

2 Answers2

2

If you want to change the column name, please try this:

import glob
import pandas as pd

filenames = glob.glob('c:/mydata/*.csv')
all_data = []

for file in filenames:
    df = pd.read_csv(file)
    if 'implementationstatus' in df.columns:
      df = df.rename(columns={'implementationstatus':'implementation'})

    all_data.append(df)
df_all = pd.concat(all_data, axis=0)
prahasanam_boi
  • 816
  • 3
  • 10
  • Wouldn't I need to change the column name as I am concatenating instead of after though? I posted the df.columns part AFTER just to show that I'm trying. Maybe I'm missing something. If you can embed within the full code that might be helpful. Thanks! – tenebris silentio Mar 18 '22 at 19:39
  • Can you give more info about the 40 datasets? Are they having the same number of columns other than the 'implementationstatus' one ? – prahasanam_boi Mar 18 '22 at 19:46
  • please let me know is this modified answer is what you want – prahasanam_boi Mar 18 '22 at 19:52
  • 1
    [Never call `DataFrame.append` or `pd.concat` inside a for-loop. It leads to quadratic copying.](https://stackoverflow.com/a/36489724/1422451). Instead append to a list and then concatenate *once* outside the loop. – Parfait Mar 18 '22 at 20:21
  • thank you for the info, I have modified the code – prahasanam_boi Mar 18 '22 at 21:07
  • Thank you. Sorry for the delay in response. These datasets will not necessarily have the same number of columns, but as long as the implementation column is in the final DF, that's fine. I don't need most of the other columns in the dataset. – tenebris silentio Mar 19 '22 at 20:29
0

You can use a combination of header and names parameters from the pd.read_csv function to solve it.

You must pass to names a list containing the names for all columns on the csv files. This will allow you to standardize all names.

From pandas docs: names: array-like, optional List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.