1

Hello StackOverflow community!

Struggling new python person here. I have code that did work, until I added more to it and I'm trying to figure out what I did wrong to screw it up. I'm trying to import a file, read the file name, remove columns, reset the index, fill a column with the filename (I need that info later on) and then move on to the next file.

For some reason, it's only importing the LAST file in the folder. I know I've done something wrong.

Any help would be very much appreciated

csvPath = "blahblah"

dfData = pd.DataFrame(['NTLogin', 'Date', '', 'FileName'])

for f in glob.glob(csvPath + "\*.csv"):
        df = pd.read_csv(f)
        filename = (os.path.basename(f))
        df.drop(df.columns[[0,1,3]], axis=1, inplace=True)
        df['ID'] = df['ID'].str.upper()
        df = df.set_index('ID').stack().reset_index()
        df['Filename'] = filename
        dfData = pd.concat([df, dfData], ignore_index=True)
J. Roybomb
  • 65
  • 6
  • I just answered this here: https://stackoverflow.com/questions/71654322/adding-data-to-an-existing-excel-table/71654898#71654898 – pyaj Mar 28 '22 at 23:53

1 Answers1

1

It is processing all the CSVs, when concatenating you are not using your base dataframe (dfData) and just using the the new dataframe (df).

Also considering the Filename, it will be overwritten everytime. Have it at df to avoid this:

df['Filename'] = filename
dfData = pd.concat([dfData, df], ignore_index=True)

List method

as suggested by pyaj in the comments, you can also use lists to achieve the same thing.

It will look like this:

csvPath = "blahblah"

df_list = []

for f in glob.glob(csvPath + "\*.csv"):
        df = pd.read_csv(f)
        filename = (os.path.basename(f))
        df.drop(df.columns[[0,1,3]], axis=1, inplace=True)
        df['ID'] = df['ID'].str.upper()
        df = df.set_index('ID').stack().reset_index()
        df['Filename'] = filename

        df_list.append(df)

dfData = pd.concat(df_list, ignore_index=True)

You can also check the list to see if each individual dataframe is correct.

farshad
  • 764
  • 10
  • 25
  • When I do it like that I get a name error: name 'dfData' is not defined – J. Roybomb Mar 29 '22 at 00:00
  • I assumed you already have `dfData` defined like `dfData = pd.DataFrame(columns=['Column1', 'Column2', 'FileName'])` before entering the loop. – farshad Mar 29 '22 at 00:03
  • 1
    I got that part fixed, but it's still only pulling in one file - the last file. Updated the original to my new code – J. Roybomb Mar 29 '22 at 00:05
  • To make sure if file is pulled at a `print(f)` to the loop and check the result. – farshad Mar 29 '22 at 00:11
  • 1
    Good sleuthing! Yup, its pulling in all files, they aren't being added to the dataframe. Okay... got it working with your help! Thank you @farshad! – J. Roybomb Mar 29 '22 at 00:14