Remove duplicate rows with certain column combination in excel using python

Question

I have a python program which reads excel documents. I need to allow only the first appearance of certain column combinations. Eg:

    A     |  B
  -------------
  1.  200 | 201   
  2.  200 | 202
  3.  200 | 201
  4.  200 | 203
  5.  201 | 201
  6.  201 | 202
  .............

I want remove/skip the third row where the duplication found and write it to a CSV file. Here is the function I've been trying so far. But it's not working.

def validateExcel(filename):
   xls=xlrd.open_workbook(filename)  
   setcount = 0
   column = 0
   count = 0
   # sheetcount = 0
   for sheet in xls.sheets():
       header=""
       # sheetcount = sheetcount + 1
       number_of_rows = sheet.nrows
       number_of_columns = sheet.ncols
       sheetname = sheet.name          
       mylist = []
       for row in range (1, number_of_rows):  
           mylist = []
           for col in range(0, 2):      
               mylist.append(sheet.cell_value(row, col))

           print mylist

           myset = set(mylist)

           print myset

I am facing the same problem – bob marti Mar 13 '17 at 09:03 — bob marti, Mar 13 '17 at 09:03

score 2 · Answer 1 · edited May 23 '17 at 12:32

2

mylist = [] was used twice and assigning single values would make it difficult. Should be like this:

mylist = []
for row in range(1, number_of_rows):  
    mylist.append((sheet.cell_value(row, 0), sheet.cell_value(row, 1)))

myset = set(mylist)

Beware that set is not ordered. If you want the result in order, check this too.

edited May 23 '17 at 12:32

Community

1
1

answered Mar 13 '17 at 08:49

Sangbok Lee

2,132
3
15
33

here how should i make an order . I didn't understand that the way you giving the link that is results in order – bob marti Mar 13 '17 at 09:05
can u give me an exaple making an order – bob marti Mar 14 '17 at 08:11
can u give a solution for sorthing as its same arranged format – bob marti Mar 15 '17 at 11:33
@RichSmith's code is enough. Instead of using `set()` use `sublist = (sheet.cell_value(row, 0), sheet.cell_value(row, 1))` / `if sublist not in mylist:` / `mylist.append(sublist)`. And remove `myset = set(mylist)`. – Sangbok Lee Mar 15 '17 at 12:14
the problem will occur if it multiple list it will not exceute – bob marti Mar 15 '17 at 12:34

score 2 · Answer 2 · edited Mar 15 '17 at 09:24

2

This should append the row (called sublist in this case) to your mylist list if it has not already been put in there. This should give you a de-duplicated list of the rows in the order they are found in the xlsx file. It might be worth looking at the pandas library if you can. If not, this should help:

def validateExcel(filename):

    xls=xlrd.open_workbook(filename)  

    for sheet in xls.sheets():
        header=""

        number_of_rows = sheet.nrows
        number_of_columns = sheet.ncols
        sheetname = sheet.name          

        mylist = []

        for row in range (1, number_of_rows):  
            sublist = [sheet.cell_value(row, col) for col in range(0, number_of_cols)]

            if sublist not in mylist:
                mylist.append(sublist)

            print mylist

     return mylist

EDIT:

if you have an xlsx file with more than one sheet, you could use a dict to store the de-duplicated row data with the sheet name as keys, then pass that dict into a csv writing function:

def validateExcel(filename):

    outputDict = {}

    xls=xlrd.open_workbook(filename)  

    sheetCount = 0

    for sheet in xls.sheets():

        number_of_rows = sheet.nrows
        number_of_columns = sheet.ncols

        sheetname = sheet.name          

        if not sheetname:
            sheetname = str(sheetCount)

        outputDict[str(sheetCount)] = []

        for row in range (1, number_of_rows):  
            sublist = [sheet.cell_value(row, col) for col in in range(0,number_of_cols)]

            if sublist not in outputDict[sheetname]:
                outputDict[sheetname].append(sublist)

            print outputDict[sheetname]

         sheetCount += 1

     return outputDict

# will go through the generated dictionary and write the data to csv files
def writeToFiles(generatedDictionary):

    for key generatedDictionary:
        with open(key + ".csv") as csvFile:
            writer = csv.writer(csvFile)
            writer.writerows(generatedDictionary[key])

If you can use pandas, something like this could work:

import pandas as pd

df = pd.read_excel(filename)

for name in df.sheetnames:

    sheetDataFrame = df.parse(name)
    filtered = sheetDataFrame.drop_duplicates()

    filtered.to_csv(name + ".csv")

edited Mar 15 '17 at 09:24

bob marti

1,523
3
11
27

answered Mar 13 '17 at 09:36

RichSmith

900
5
11

here where u are removing duplicates? – bob marti Mar 13 '17 at 09:48
@bobmarti so the duplicates should be filtered out of my list with the line `if sublist not in mylist: ...`, so if a row with the same values has already been appended to `mylist` it wont be put in there – RichSmith Mar 13 '17 at 09:49
when I use this code I got an error `unhashable type: 'list'` – bob marti Mar 13 '17 at 10:11
ok cool, it might be to do with the sheet names, what line causes it? – RichSmith Mar 13 '17 at 10:26
i am running a python project.so i cant identify the line number – bob marti Mar 13 '17 at 10:31
ok, print the sheet names and make sure there are names for the sheets, if there is just one sheet, use the first function the `unhashable type: list` error is because it is trying to use a list as a `key` in the dictionary, so I'm guessing the problem is something to do with the generation of the keys – RichSmith Mar 13 '17 at 10:36
It is only one sheet and I tried only the first case before the EDIT code – bob marti Mar 13 '17 at 11:01
Oops... Exception : unhashable type: 'list' Message : unhashable type: 'list' – bob marti Mar 13 '17 at 11:11
made an edit to the first function, see if that works any better? – RichSmith Mar 13 '17 at 11:18
just thought, if the `1. 2. 3. ` prefixes are actually in your first column in every row, you'll need to remove them or every row will be unique, maybe split the first column in each row and get the second part of the cell `str(sheet.cell_value(row, col)).split(". ")[1]` (assuming that all the cells are formatted the same way) – RichSmith Mar 13 '17 at 11:32
when i do the second case multiple sheets it occurs error.Exception : u'Sheet1' – bob marti Mar 14 '17 at 08:15

score 2 · Accepted Answer · answered Mar 15 '17 at 15:51

It is worked for me: In python 2.7

def validateExcel(filename):
   xls=xlrd.open_workbook(filename)  
   setcount = 0
   column = 0
   count = 0
   # sheetcount = 0
   for sheet in xls.sheets():
       header=""
       # sheetcount = sheetcount + 1
       number_of_rows = sheet.nrows
       number_of_columns = sheet.ncols
       sheetname = sheet.name          
       mylist = []
       for row in range(1, number_of_rows):  
            mylist.append((sheet.cell_value(row, 0), sheet.cell_value(row, 1)))
       myset = sorted(set(mylist), key=mylist.index)
       return myset

score 2 · Answer 4 · answered Jan 10 '18 at 10:03

Here is my solution. Remove duplicates and create a new file without duplicates.

import xlsxwriter
import xlrd

def remove_duplicates():

read_file = xlrd.open_workbook('Original.xlsx')
write_file = xlsxwriter.Workbook ('Removed_Duplicates.xlsx')

for sheet in read_file.sheets():
    no_rows = sheet.nrows
    no_cols = sheet.ncols
    name = sheet.name
    gen_sheets = write_file.add_worksheet(name)
    line_list = []
    r = 0
    for row in range(0, no_rows):
        line_sublist = [sheet.cell(row, col).value for col in range(0, no_cols)]
        if line_sublist not in line_list:
            line_list.append(line_sublist)
            for col in range(0, no_cols):
                gen_sheets.write(r,col,line_sublist[col])
            r = r + 1
write_file.close()

score 1 · Answer 5 · answered Apr 11 '22 at 15:11

We can do this by using python pandas package

To install this package: pip install pandas

Reference: https://pandas.pydata.org/docs/getting_started/install.html

Use drop_duplicates() without any arguments

import pandas as pd
data = pd.read_excel('your_excel_path_goes_here.xlsx')
#print(data)
data.drop_duplicates()

Use drop_duplicates() by using column name

import pandas as pd
data = pd.read_excel('your_excel_path_goes_here.xlsx')
#print(data)
data.drop_duplicates(subset=["YOUR_COLUMN_NAME_GOES_HERE"], keep="last")

keep=first to instruct Python to keep the first value and remove other columns duplicate values.

keep=last to instruct Python to keep the last value and remove other columns duplicate values.

Suppose we want to remove all duplicate values in the excel sheet. We can specify keep=False

Remove duplicate rows with certain column combination in excel using python

5 Answers5