1

There are several Excel files in a folder. Their structures are the same and contents are different. I want to combine them into 1 Excel file, read in this sequence 55.xlsx, 44.xlsx, 33.xlsx, 22.xlsx, 11.xlsx.

These lines are doing a good job:

import os
import pandas as pd


working_folder = "C:\\temp\\"
files = os.listdir(working_folder)
files_xls = []

for f in files:
    if f.endswith(".xlsx"):
        fff = working_folder + f
        files_xls.append(fff)

df = pd.DataFrame()

for f in reversed(files_xls):
    data = pd.read_excel(f) #, sheet_name = "")
    df = df.append(data)

df.to_excel(working_folder + 'Combined 1.xlsx', index=False)

The picture shows how the original sheets looked like, also the result.

enter image description here

But in the sequential reading, I want only the unique rows to be appended, in addition to what’s in the data frame.

In this case:

  1. the code read the file 55.xlsx first, then 44.xlsx, then 33.xlsx…

  2. when it reads 44.xlsx, the row 444 Kate should not be appended as there were already a Kate from previous data frame.

  3. when it reads 33.xlsx, the row 333 Kate should not be appended as there were already a Kate from previous data frame.

  4. when it reads 22.xlsx, the row 222 Jack should not be appended as there were already a Jack from previous data frame.

By the way, here are the data frames (instead of Excel files) for your convenience.

d5 = {'Code': [555, 555], 'Name': ["Jack", "Kate"]}
d4 = {'Code': [444, 444], 'Name': ["David", "Kate"]}
d3 = {'Code': [333, 333], 'Name': ["Paul", "Kate"]}
d2 = {'Code': [222, 222], 'Name': ["Jordan", "Jack"]}
d1 = {'Code': [111, 111], 'Name': ["Leslie", "River"]}
halfer
  • 19,824
  • 17
  • 99
  • 186
Mark K
  • 8,767
  • 14
  • 58
  • 118

2 Answers2

2
df.drop_duplicates(subset=['name'], keep='first')
Jay Shankar Gupta
  • 5,918
  • 1
  • 10
  • 27
1

I think need drop_duplicates:

import glob 

working_folder = "C:\\temp\\"
files = glob.glob(working_folder + '/*.xlsx')
dfs = [pd.read_excel(fp) for fp in files]
df = pd.concat(dfs)

df = df.drop_duplicates('Name')
df.to_excel(working_folder + 'Combined 1.xlsx', index=False)

Solution with data and inverse sorting files:

import glob 

working_folder = "C:\\temp\\"
files = glob.glob(working_folder + '/*.xlsx')
print (files)
['C:\\temp\\11.xlsx', 'C:\\temp\\22.xlsx', 'C:\\temp\\33.xlsx', 
 'C:\\temp\\44.xlsx', 'C:\\temp\\55.xlsx']

files = sorted(files, key=lambda x: int(x.split('\\')[-1][:-5]), reverse=True)
print (files)
['C:\\temp\\55.xlsx', 'C:\\temp\\44.xlsx', 'C:\\temp\\33.xlsx',
 'C:\\temp\\22.xlsx', 'C:\\temp\\11.xlsx']

dfs = [pd.read_excel(fp) for fp in files]
df = pd.concat(dfs)
print (df)
   Code    Name
0   555    Jack
1   555    Kate
0   444   David
1   444    Kate
0   333    Paul
1   333    Kate
0   222  Jordan
1   222    Jack
0   111  Leslie
1   111   River

df = df.drop_duplicates('Name')
print (df)
   Code    Name
0   555    Jack
1   555    Kate
0   444   David
0   333    Paul
0   222  Jordan
0   111  Leslie
1   111   River

df.to_excel(working_folder + 'Combined 1.xlsx', index=False)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252