Using Python to combine selective rows from data frames into 1

Question

There are several Excel files in a folder. Their structures are the same and contents are different. I want to combine them into 1 Excel file, read in this sequence 55.xlsx, 44.xlsx, 33.xlsx, 22.xlsx, 11.xlsx.

These lines are doing a good job:

import os
import pandas as pd


working_folder = "C:\\temp\\"
files = os.listdir(working_folder)
files_xls = []

for f in files:
    if f.endswith(".xlsx"):
        fff = working_folder + f
        files_xls.append(fff)

df = pd.DataFrame()

for f in reversed(files_xls):
    data = pd.read_excel(f) #, sheet_name = "")
    df = df.append(data)

df.to_excel(working_folder + 'Combined 1.xlsx', index=False)

The picture shows how the original sheets looked like, also the result.

But in the sequential reading, I want only the unique rows to be appended, in addition to what’s in the data frame.

In this case:

the code read the file 55.xlsx first, then 44.xlsx, then 33.xlsx…
when it reads 44.xlsx, the row 444 Kate should not be appended as there were already a Kate from previous data frame.
when it reads 33.xlsx, the row 333 Kate should not be appended as there were already a Kate from previous data frame.
when it reads 22.xlsx, the row 222 Jack should not be appended as there were already a Jack from previous data frame.

By the way, here are the data frames (instead of Excel files) for your convenience.

d5 = {'Code': [555, 555], 'Name': ["Jack", "Kate"]}
d4 = {'Code': [444, 444], 'Name': ["David", "Kate"]}
d3 = {'Code': [333, 333], 'Name': ["Paul", "Kate"]}
d2 = {'Code': [222, 222], 'Name': ["Jordan", "Jack"]}
d1 = {'Code': [111, 111], 'Name': ["Leslie", "River"]}

score 2 · Answer 1 · answered Mar 15 '18 at 06:24

2

df.drop_duplicates(subset=['name'], keep='first')

answered Mar 15 '18 at 06:24

Jay Shankar Gupta

5,918
1
10
27

jezrael · Accepted Answer · 2018-03-15T06:39:19.503

1

I think need drop_duplicates:

import glob 

working_folder = "C:\\temp\\"
files = glob.glob(working_folder + '/*.xlsx')
dfs = [pd.read_excel(fp) for fp in files]
df = pd.concat(dfs)

df = df.drop_duplicates('Name')
df.to_excel(working_folder + 'Combined 1.xlsx', index=False)

Solution with data and inverse sorting files:

import glob 

working_folder = "C:\\temp\\"
files = glob.glob(working_folder + '/*.xlsx')
print (files)
['C:\\temp\\11.xlsx', 'C:\\temp\\22.xlsx', 'C:\\temp\\33.xlsx', 
 'C:\\temp\\44.xlsx', 'C:\\temp\\55.xlsx']

files = sorted(files, key=lambda x: int(x.split('\\')[-1][:-5]), reverse=True)
print (files)
['C:\\temp\\55.xlsx', 'C:\\temp\\44.xlsx', 'C:\\temp\\33.xlsx',
 'C:\\temp\\22.xlsx', 'C:\\temp\\11.xlsx']

dfs = [pd.read_excel(fp) for fp in files]
df = pd.concat(dfs)
print (df)
   Code    Name
0   555    Jack
1   555    Kate
0   444   David
1   444    Kate
0   333    Paul
1   333    Kate
0   222  Jordan
1   222    Jack
0   111  Leslie
1   111   River

df = df.drop_duplicates('Name')
print (df)
   Code    Name
0   555    Jack
1   555    Kate
0   444   David
0   333    Paul
0   222  Jordan
0   111  Leslie
1   111   River

df.to_excel(working_folder + 'Combined 1.xlsx', index=False)

edited Mar 15 '18 at 06:39

answered Mar 15 '18 at 06:22

jezrael

822,522
95
1,334
1,252

thank you! super efficient and it works! a question: how does this method ensure the reading sequence is 55, 44, 33, 22, 11? (just afraid it miss-kills) – Mark K Mar 15 '18 at 06:26
1

hmm, I think need sort it, give me some time, I have to test it. – jezrael Mar 15 '18 at 06:27
I think sort list of files `files` – jezrael Mar 15 '18 at 06:28
thanks! above works fine on the sample, but if the data frame is large then no way to manually detect if there are mis-kills – Mark K Mar 15 '18 at 06:28
1

@MarkK - drop duplicates by default keep first row if dupe. So it should work in real data too. – jezrael Mar 15 '18 at 06:29
superb! nice and easy! – Mark K Mar 15 '18 at 06:32
1

@MarkK - I add solution with inverse sorting by files names. – jezrael Mar 15 '18 at 06:42

Using Python to combine selective rows from data frames into 1

2 Answers2