There are several Excel files in a folder. Their structures are the same and contents are different. I want to combine them into 1 Excel file, read in this sequence 55.xlsx, 44.xlsx, 33.xlsx, 22.xlsx, 11.xlsx.
These lines are doing a good job:
import os
import pandas as pd
working_folder = "C:\\temp\\"
files = os.listdir(working_folder)
files_xls = []
for f in files:
if f.endswith(".xlsx"):
fff = working_folder + f
files_xls.append(fff)
df = pd.DataFrame()
for f in reversed(files_xls):
data = pd.read_excel(f) #, sheet_name = "")
df = df.append(data)
df.to_excel(working_folder + 'Combined 1.xlsx', index=False)
The picture shows how the original sheets looked like, also the result.
But in the sequential reading, I want only the unique rows to be appended, in addition to what’s in the data frame.
In this case:
the code read the file 55.xlsx first, then 44.xlsx, then 33.xlsx…
when it reads 44.xlsx, the row 444 Kate should not be appended as there were already a Kate from previous data frame.
when it reads 33.xlsx, the row 333 Kate should not be appended as there were already a Kate from previous data frame.
when it reads 22.xlsx, the row 222 Jack should not be appended as there were already a Jack from previous data frame.
By the way, here are the data frames (instead of Excel files) for your convenience.
d5 = {'Code': [555, 555], 'Name': ["Jack", "Kate"]}
d4 = {'Code': [444, 444], 'Name': ["David", "Kate"]}
d3 = {'Code': [333, 333], 'Name': ["Paul", "Kate"]}
d2 = {'Code': [222, 222], 'Name': ["Jordan", "Jack"]}
d1 = {'Code': [111, 111], 'Name': ["Leslie", "River"]}