0

I am trying to concatenate a time column and one other single column from multiple xlsx files, dropping the rest. My problem is, is that some of the xlsx files contain more columns than others, and when running it in a loop i get a ' index 4 is out of bounds for axis 0 with size 4' IndexError message.

My code is quite convoluted so but ive included all of it incase it is needed.

Importing modules

import pandas as pd
import numpy as np

Defining variables

DIR = r'C:\Directory'+'\\'

FILE = "Filename.xlsx"

Below is how i read in the directory xlsx file

def get_df_from_xlsx(DIR,FILE):
    DF= pd.read_excel(DIR+FILE, sheet_name='Sheet1')
    return DF

DF=get_df_from_xlsx(DIR, FILE)

This excel file has a column named 'Filename' which includes all the filenames of the xlsx files i need (over 300), which are in the same directory.

I create a list of the filenames with the code below and also the length of the list

ALL_COL= get_df_from_xlsx(DIR, FILE)['Filename'].values[0:]
numfiles= len(ROW1)

Each of these are just one file name which i use to create a DF which i later add values to.

COL= get_df_from_xlsx(DIR, FILE)['Filename'].values[1]
COL2= get_df_from_xlsx(DIR, FILE)['Filename'].values[4]

Beneath i open a xlsx file (COL) from columns and rename the column i want to join to the other files. I do this twice (def open_file(DIR, COL) and def open_file2(DIR, COL2)

def open_file(DIR, COL):
    df1 = pd.read_excel(DIR+COL)
    df1 = df1.rename(columns={'col_name':COL})
    df1 = df1.drop(df1.columns[[1,3]], axis=1)
    return df1
df1 =open_file(DIR, COL)

def open_file2(DIR, COL2):
    df2 = pd.read_excel(DIR+COL2)
    df2 = df2.rename(columns={'col_name':COL2})
    df2 = df2.drop(df2.columns[[1,3]], axis=1)
    return df2
df2 =open_file2(DIR, COL2)

The code beneath concat's both dataframes together, chronologically keeping all the data, infilling NaN when one column didnt have that time in the original file.

DF11 =pd.concat([s.set_index('Time') for s in [df1, df2]], axis=1).rename_axis('Time').reset_index()

Then all columns apart from the time column are dropped.

DF11 = DF11.drop(DF11.columns[[1,2]], axis=1)

DF11

The final step is this, when i run it like this it works as all original xlsx files have columns 1 and 3, however some have columns 4 and 5. if i try to drop them i get an error.

for i in range(numfiles): 
    df=pd.read_excel(DIR+ALL_COL[i])
    df=df.rename(columns={'col_name':ALL_COL[i]})
    df=df.drop(df.columns[[1,3]], axis=1) 
    DF11=pd.concat([i.set_index('Time') for i in [df, DF11]], axis=1).rename_axis('Time').reset_index()
DF11

I have tried multiple variations of this line including.

df=df.drop(df.columns[[1,3,4,5]], axis=1) #this does not work

using the two lines of code like follows

df=df.drop(df.columns[[1,3]], axis=1)
df=df.drop(columns=['four', 'five']) #with and without 'axis=1' and inplace=True

and like this:

df.drop(columns=['one', 'three', 'four', 'five'])#with and without 'axis=1' and inplace=True

all give me the same error: index 4 is out of bounds for axis 0 with size 4'.

I will try and show what what the xlxs look like and how they differ.

example 1 (less columns)

Time,    Dont need1,    need 1,     dont need 2
xxxx     xxx            xx           xxxx

example 2 (2 more columns)

Time,    Dont need1,    need 1,     dont need 2,    dont need 3,    dont need 4
xxxx     xxx            xx           xxxx           xxxxx           xxxx

please note that all column names are the same across the xlsx files (apart from if they are not there).

is there a way to delete columns 4 and 5 if they are there and ignore them if they are not? Thanks

Im on python 3.7, using jupyter notebook.

Josh Alt
  • 29
  • 9
  • If you know which columns you need can't you just [specify reading those](https://pandas.pydata.org/docs/user_guide/io.html#parsing-specific-columns) and none of the others? Then you won't have to drop anything. – wwii Apr 30 '20 at 15:30
  • Does this answer your question? [how to read certain columns from Excel using Pandas - Python](https://stackoverflow.com/questions/33655127/how-to-read-certain-columns-from-excel-using-pandas-python) ... there are others searching with `python pandas read specific columns of excel file site:stackoverflow.com` or other variations. – wwii Apr 30 '20 at 15:33
  • @wwii yes thanks mate! – Josh Alt Apr 30 '20 at 17:12

0 Answers0