0

I am new to Python and trying to automate some tasks. I have an Excel file with 8 sheets where each sheet has some identifier on top followed below that are tabular data with headers. Each sheet has the identifiers of interest and the tables in the same location. What I want to do is to extract some data from the top of each sheet and insert them as columns, remove unwanted rows(after I have assigned some of them to columns) and columns and then merge into one CSV file as output.

The code I have written does the job. My code reads in each sheet, performs the operations on the sheet, then I start the same process for the next sheet (8 times) before using .concat to merge them.

import pandas as pd
import numpy as np
inputfile = "input.xlsx" 
outputfile = "merged.csv"

##LN X: READ FIRST SHEET AND ASSIGN HEADER INFORMATION TO COLUMNS

df1 = pd.read_excel(inputfile, sheet_name=0, usecols="A:N", index=0)

#Define cell locations of fields in the header area to be assigned to 
columns
#THIS CELL LOCATIONS ARE SAME ON ALL SHEETS
A = df1.iloc[3,4]
B = df1.iloc[2,9]
C = df1.iloc[3,9]
D = df1.iloc[5,9]
E = df1.iloc[4,9]

#Insert well header info as columns in data for worksheet1
df1.insert(0,"column_name", A)

df1.insert(1,"column_name", B)

df1.insert(4,"column_name", E)

# Rename the columns in `enter code here`worksheet1 DataFrame to reflect 
actual column headers
df1.rename(columns={'Unnamed: 0': 'Header1', 
               'Unnamed: 1': 'Header2', }, inplace=True)         

df_merged = pd.concat([df1, df2, df3, df4, df5, df6, df7, 
  df8],ignore_index=True, sort=False)

#LN Y: Remove non-numerical entries

df_merged = df_merged.replace(np.nan, 0)

##Write results to CSV file

df_merged.to_csv(outputfile, index=False)

Since this code will be used on other Excel files with varying numbers of sheets, I am looking for any pointers on how to include the repeating operations in each sheet in a loop. Basically repeating the steps between LN X to LN Y for each sheet (8 times!!). I am struggling with how to use a loop function Thanks in advance for your assistance.

okrika
  • 3
  • 2
  • Hi, great job so far! I would start with the answer in this SO post: https://stackoverflow.com/questions/17977540/pandas-looking-up-the-list-of-sheets-in-an-excel-file?lq=1 to get the sheets in a dictionary, then loop over the dict to process each sheet. Make sense? – mgrollins Oct 08 '19 at 22:48
  • @mgrollins Thanks for your comment. I read that post before posting. I know how to read in the sheets using ExcelFile or read-excel (single sheet, all at once). The challenge I am facing is how to run the operations for each sheet in a loop ie, write the functions once and have it apply to each sheet as they are read into pandas. – okrika Oct 09 '19 at 00:55

1 Answers1

1
df1 = pd.read_excel(inputfile, sheet_name=0, usecols="A:N", index=0)

You should change the argument sheet_name to

sheet_name=None

Then df1 will be a dictionary of DataFrames. Then you can loop over df1 using

for df in df1:

  df1[df].insert(0,"column_name", A)
....

Now perform your operations and merge the dfs. You can loop over them again and concatenate them to one final df.

robbwh
  • 337
  • 2
  • 9
  • Thanks for your answer. I tried that and got AttributeError: 'collections.OrderedDict' object has no attribute 'iloc'. This is for the part where I was mapping the content of the new columns to a cell address. For example well = df1[3.4] – okrika Oct 09 '19 at 03:09
  • Thanks. This worked and made my code much shorter. The best part is that it can now be used on similar files with any number of sheets!! – okrika Oct 09 '19 at 04:16