2

I have an excel file with multiple sheets which need to be consolidated. However the column headers are varying from one another. Currently the data looks like this.

Sheet 1
+-------------+--------------+----------+--------+---------+---------+
| FISCAL_YEAR | COMPANY_CODE | ACCOUNTS | Header | Header1 | Header2 |
+-------------+--------------+----------+--------+---------+---------+
|          17 | Data         | Data     |      0 |       0 |       0 |
|          17 | Data         | Data     |      0 |       0 |       0 |
+-------------+--------------+----------+--------+---------+---------+

Sheet 2
+-------------+--------------+----------+---------+---------+
| FISCAL_YEAR | COMPANY_CODE | ACCOUNTS | Header3 | Header2 |
+-------------+--------------+----------+---------+---------+
|          15 | Data         | Data     |       0 |       0 |
|          15 | Data         | Data     |       0 |       0 |
+-------------+--------------+----------+---------+---------+

Sheet 3
+-------------+--------------+----------+---------+---------+---------+
| FISCAL_YEAR | COMPANY_CODE | ACCOUNTS | Header4 | Header1 | Header3 |
+-------------+--------------+----------+---------+---------+---------+
|          16 | Data         | Data     |       0 |       0 |       0 |
|          16 | Data         | Data     |       0 |       0 |       0 |
+-------------+--------------+----------+---------+---------+---------+

OUTPUT
+-------------+--------------+----------+--------+---------+---------+---------+---------+-----------+
| FISCAL_YEAR | COMPANY_CODE | ACCOUNTS | Header | Header1 | Header2 | Header3 | Header4 | SheetName |
+-------------+--------------+----------+--------+---------+---------+---------+---------+-----------+
|          17 | Data         | Data     | 0      | 0       | 0       | null    | null    | Sheet1    |
|          17 | Data         | Data     | 0      | 0       | 0       | null    | null    | Sheet1    |
|          15 | Data         | Data     | null   | null    | 0       | 0       | null    | Sheet2    |
|          15 | Data         | Data     | null   | null    | 0       | 0       | null    | Sheet2    |
|          16 | Data         | Data     | null   | 0       | null    | 0       | 0       | Sheet3    |
|          16 | Data         | Data     | null   | 0       | null    | 0       | 0       | Sheet3    |
+-------------+--------------+----------+--------+---------+---------+---------+---------+-----------+

I am relatively new to Python. I have used Pandas and numpy. I have as many as 60 sheets to work. Can anyone help me to understand how can I achieve this? If not python is there anyway other tool/method I should use? I could really use a code sample to start with.

Your help is really appreciated. Thank you in advance

sacuL
  • 49,704
  • 8
  • 81
  • 106

3 Answers3

1

Using R, this is pretty easy to do.

library(openxlsx) # to read xlsx files
library(purrr)    # for the "map" function

wb <- loadWorkbook("path/filename.xlsx")
all_sheets <- names(wb)

merged_data <- map_df(all_sheets, ~ read.xlsx(wb, sheet = .x)
Melissa Key
  • 4,476
  • 12
  • 21
0

Using for loop and rbind in R :

for (i in file.list) {
    data <- rbind(data, read.xlsx(i, sheetIndex = 1))
}

rbind usage : To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order.

total <- rbind(data frameA, data frameB) 
Ali
  • 350
  • 3
  • 10
  • I think the whole point is that they do not all have the same variables. If you aren't familiar, check out `bind_rows()` in `dplyr` - it will fill missing variables if only one data set has them. Because `dplyr` and `purrr` are both designed to work together, the solution above will work even without all the variables being present in all files. – Melissa Key Apr 15 '18 at 22:33
0
import pandas as pd

filepath = r"filePath here"
sheets_dict = pd.read_excel(filepath, sheet_name=None)

full_table = pd.DataFrame()

#loop through sheets
for name, sheet in sheets_dict.items():
    sheet['sheet'] = name
    #sheet = sheet.rename(columns=lambda x: x.split('\n')[-1])
    full_table = full_table.append (sheet)

full_table.reset_index (inplace=True, drop=True)

#Write to Excel
writer = pd.ExcelWriter('consolidated_TB1.xlsx', engine='xlsxwriter')
full_table.to_excel(writer,'Sheet1')

# Close the Pandas Excel writer and output the Excel file.
writer.save()
Itz
  • 3
  • 2