-3

I have city addresses I need to put together and find all duplicates. I got to a point where I can find all the duplicates in excel files, easy so far. But I have to change each city in the code to search each file. How do I search each file without having to change the city in the code and then save it of course. I want to merge them which I can but I can't figure out why they create their own 3 columns and don't just merge with the 'A','B' & 'C' columns already there. Maybe Pandas isn't the best library to do this with and a better one can be suggested.

import os

file_df = pd.read_excel("Kermit.xlsx")

file_df.duplicated(subset = 'Address', keep = False)

file_df.drop_duplicates(subset= 'Address',inplace= True)

City = file_df.to_excel("Kermit2.xlsx", index= False)

# path = os.getcwd()
# files = os.listdir(path)
# print(files)

# files_xlsx = [f for f in files if f[-4:] == 'xlsx']

# print(files_xlsx)

# df = pd.DataFrame()

# for f in files_xlsx:
#    data = pd.read_excel(f, 'Sheet1')
#    df = df.append(data)`import os
aarongc43
  • 1
  • 2
  • 1
    Do you realise that the variable `City` can only have a value of `None` since `to_excel()` doesn't return anything? Why should you care about overwriting it? Just run it in a for-loop. – NotAName Oct 25 '20 at 23:18
  • 1
    Can you explain in more detail what you mean by: "they create their own 3 columns and don't just merge with the 'A','B' & 'C' columns already there." – NotAName Oct 25 '20 at 23:21
  • The files have three columns, Name, Address and Phone. When I merge two or more to a new file the 'A', 'B' & 'C' columns from the second city will go and create a 'D', 'E' & 'F' columns, so on and so on for all 12 or so cities I have. Instead what I want is for everything to go under the same three columns since its all the same data type of course. – aarongc43 Oct 25 '20 at 23:31
  • _Instead what I want is for everything to go under the same three columns since its all the same data type of course._ Take a look at https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html. – AMC Oct 25 '20 at 23:44
  • Thanks! Those both answered my questions. – aarongc43 Oct 26 '20 at 00:47

1 Answers1

0

It's hard to answer your question properly without knowing what your data is like and what your file naming is. I'll assume that all your excel files are in the same folder and they have same 3 columns of data.

In that case all you need to do is:

import os
import pandas as pd

source_folder = 'your_folder_location'
files = [os.path.join(source_folder, x) for x in os.listdir(source_folder) if x.endswith('xlsx')]

data = pd.append(pd.read_excel(x) for x in files)
data.drop_duplicates(subset=['Address'], inplace=True)
data.to_excel('blablabla.xlsx', index=False) 

NotAName
  • 3,821
  • 2
  • 29
  • 44