0

I have about 50 excel sheet with .'xlsb' extension. I'd like to concatanate a specific worksheet into pandas DataFrame (all worksheets names are the same). The problem I have is that the names of columns are not exactly the same in each worksheet. I wrote a code using pandas but the way it works is that it concats all values into the same column in pandas data frame but based on the name of column. So for example: sometimes I have column called: FgsNr and sometimes FgNr - the datatype and the meaning in both columns are exactly the same and I would like to have them in the same column in Data Frame but pandas creates to separate columns in data frame and stack together only those values that are listed in column with the same name.

files = glob(r'C:\Users\Folder\*xlsb')

for file in files:
    Datafile = pd.concat(pd.read_excel(file, engine='pyxlsb', sheet_name='Sheet1', usecols='A:F', header=0) for file in files)

How could I correct the code so it copied and concatanted all values based on columns from excel at the same time ignoring the names of columns ?

Calle
  • 151
  • 1
  • 7
  • you'll have to use something like diff lib or do it manually and fix the column names on the import - is this a one time operation or will it be recurring? – Umar.H Aug 10 '21 at 13:33
  • it's a monthly report and different people prepare it so I believe that the column names might change in the future again – Calle Aug 10 '21 at 13:43
  • Not sure if youll be able to do that whle reading the files, that can be done post reading the files, if you can share sample data then we can help you.. – Ibrahim Ayoup Aug 10 '21 at 13:43
  • can you share an example of how the column names may change? – Umar.H Aug 10 '21 at 13:44

1 Answers1

1

When concatenating multiple dataframes with the same format, you can use the below snippet for speed and efficiency.

The basic logic is that you put them into a list, and then concatenate at the final stage.

files = glob(r'C:\Users\Folder\*xlsb')
dfs = []
for file in files:
    df = pd.read_excel(file, engine='pyxlsb', sheet_name='Sheet1', usecols='A:F', header=0)
    dfs.append(df)
large_df = pd.concat(dfs, ignore_index=True)

Also refer to the below : Creating an empty Pandas DataFrame, then filling it?

Hanalia
  • 187
  • 1
  • 9