I am trying to concatenate a time column and one other single column from multiple xlsx files, dropping the rest. My problem is, is that some of the xlsx files contain more columns than others, and when running it in a loop i get a ' index 4 is out of bounds for axis 0 with size 4' IndexError message.
My code is quite convoluted so but ive included all of it incase it is needed.
Importing modules
import pandas as pd
import numpy as np
Defining variables
DIR = r'C:\Directory'+'\\'
FILE = "Filename.xlsx"
Below is how i read in the directory xlsx file
def get_df_from_xlsx(DIR,FILE):
DF= pd.read_excel(DIR+FILE, sheet_name='Sheet1')
return DF
DF=get_df_from_xlsx(DIR, FILE)
This excel file has a column named 'Filename' which includes all the filenames of the xlsx files i need (over 300), which are in the same directory.
I create a list of the filenames with the code below and also the length of the list
ALL_COL= get_df_from_xlsx(DIR, FILE)['Filename'].values[0:]
numfiles= len(ROW1)
Each of these are just one file name which i use to create a DF which i later add values to.
COL= get_df_from_xlsx(DIR, FILE)['Filename'].values[1]
COL2= get_df_from_xlsx(DIR, FILE)['Filename'].values[4]
Beneath i open a xlsx file (COL) from columns and rename the column i want to join to the other files. I do this twice (def open_file(DIR, COL) and def open_file2(DIR, COL2)
def open_file(DIR, COL):
df1 = pd.read_excel(DIR+COL)
df1 = df1.rename(columns={'col_name':COL})
df1 = df1.drop(df1.columns[[1,3]], axis=1)
return df1
df1 =open_file(DIR, COL)
def open_file2(DIR, COL2):
df2 = pd.read_excel(DIR+COL2)
df2 = df2.rename(columns={'col_name':COL2})
df2 = df2.drop(df2.columns[[1,3]], axis=1)
return df2
df2 =open_file2(DIR, COL2)
The code beneath concat's both dataframes together, chronologically keeping all the data, infilling NaN when one column didnt have that time in the original file.
DF11 =pd.concat([s.set_index('Time') for s in [df1, df2]], axis=1).rename_axis('Time').reset_index()
Then all columns apart from the time column are dropped.
DF11 = DF11.drop(DF11.columns[[1,2]], axis=1)
DF11
The final step is this, when i run it like this it works as all original xlsx files have columns 1 and 3, however some have columns 4 and 5. if i try to drop them i get an error.
for i in range(numfiles):
df=pd.read_excel(DIR+ALL_COL[i])
df=df.rename(columns={'col_name':ALL_COL[i]})
df=df.drop(df.columns[[1,3]], axis=1)
DF11=pd.concat([i.set_index('Time') for i in [df, DF11]], axis=1).rename_axis('Time').reset_index()
DF11
I have tried multiple variations of this line including.
df=df.drop(df.columns[[1,3,4,5]], axis=1) #this does not work
using the two lines of code like follows
df=df.drop(df.columns[[1,3]], axis=1)
df=df.drop(columns=['four', 'five']) #with and without 'axis=1' and inplace=True
and like this:
df.drop(columns=['one', 'three', 'four', 'five'])#with and without 'axis=1' and inplace=True
all give me the same error: index 4 is out of bounds for axis 0 with size 4'.
I will try and show what what the xlxs look like and how they differ.
example 1 (less columns)
Time, Dont need1, need 1, dont need 2
xxxx xxx xx xxxx
example 2 (2 more columns)
Time, Dont need1, need 1, dont need 2, dont need 3, dont need 4
xxxx xxx xx xxxx xxxxx xxxx
please note that all column names are the same across the xlsx files (apart from if they are not there).
is there a way to delete columns 4 and 5 if they are there and ignore them if they are not? Thanks
Im on python 3.7, using jupyter notebook.