Python: How to find duplicates from two CSV columns, and any extra or 'missing' values

Question

Basically we have these two columns, and we want to get the 'Have' column to match the 'Should Have' column. I want the code to create a new column showing the duplicates or extras we have in 'Have' and another column showing the amounts that are in 'Should Have' but are missing from 'Have'.

I'm not sure how to get it to spot duplicates.

File: test.csv

Post the CSV as text in your question. Also, [what have you tried so far](http://stackoverflow.com/help/how-to-ask)? — aneroid, Jan 25 '17 at 02:15
I found a few similar questions to this one that compared two csvs but they didn't account for duplicates. In the most helpful code I found so far, the first column in 'hosts.csv' was compared to first column in 'masterlist.csv', and a 'results.csv' was created, saying if each row in hosts was found somewhere in masterlist. However, I also need to know how many times it was found in masterlist so I can get rid of any extras, or 'duplicates'. If the amount '2.75' was found in the master column 12 times , for example, it should be in the other column 12 times as well — Paul Marinos, Jan 25 '17 at 02:23
I'm working with hundreds of lines in each column. I want to make a code that instantly shows me what is in column1 but not in column2, and the other way around, and the amount of time they're in each column is important — Paul Marinos, Jan 25 '17 at 02:26

score 0 · Accepted Answer · edited May 23 '17 at 10:30

0

Wasn't too sure what all possible cases are judging from the question, for example what will constitute "missing"? Why isn't rows 2, 3, and 5 not listed despite the "Have" value not equaling the "Should Have" value?

Anyways, hope this serves as a useful guide. You can read more about CSV processing in the reference

import csv

with open("test.csv", 'r') as fd:
    csv_reader = csv.reader(fd)

    #Skip the header
    next(csv_reader, None)

    #init arrays
    have, should_have = {}, {}

    row_counter = 0
    for row in csv_reader:

        have.setdefault(float(row[0]),[]).append(row_counter)

        try:
            should_have.setdefault(float(row[1]),[]).append(row_counter)
        except IndexError:
            continue

        row_counter += 1

extras = [(k,v) for k, v in have.items() if k not in should_have]

have_excess = []
for k, v in have.items():
    if ((k,v) not in extras) and (len(v) > 1):

        if (len(v) > len(should_have[k])):
            have_excess.append((k,v))

missing = []
for k, v in should_have.items():

    if k not in have:
        missing.append((k,v))
    elif (len(v) > len(have[k])):
        missing.append((k,v))

File:

~$ cat test.csv
    have  , should-have
    200   , 88.74
    42.5  , 42.5
    2.99  , 88.74
    88.74 , 2.99
    200   , 200
    200   , 8.7
    8.7   , 9.99
    1000  , 200
    200
    88.74
    2000, 88.74

Output:

>> print(extras)
# [(2000.0, [8]), (1000.0, [7])]

>> print(missing)
# [(88.74, [0, 2, 8]), (9.99, [6])]

>> print(duplicates)
# [(200.0, [0, 4, 5, 8])]

Some notes:

The dict.setdefault(key,[]).append(value) method allows us to easily append (or create a new list) to a key, so that we can store multiple row-ids for each value (see this question)
use iteritems() if you're using Python 2.x and items() for Python 3.x
I formatted the output lists with (key, value) tuples. The reason being is that I was not sure which row-ids you would like to keep/discard, so left them all in there! To index them tupl = output_list[#] will give you one tuple, and tupl[1] or tupl[0] will give you the value or list of row-ids respectively.

edited May 23 '17 at 10:30

Community

1
1

answered Jan 25 '17 at 02:38

Jamie Phan

1,112
7
15

Thanks for the contribution. Row 2 has the amount 2.99 in 'Have', and there is also a 2.99 in 'Should Have' so those values are good. They "cross each other off" if you will. Same thing for the 88.74's. We have the value 200 twice in the Have column, but its only there once in the Should Have column, so there is one extra, or duplicate 200. Hope this clears it up – Paul Marinos Jan 25 '17 at 02:44
The values do not have to be adjacent to each other. They only need to show up the same exact number of times in column 1 as column 2! – Paul Marinos Jan 25 '17 at 02:46
Okay, understood! Edited to fix the missing values (use `set` if you want to find unique values in a list). Not personally happy with the solution for duplicates, hopefully you can find a more elegant way! – Jamie Phan Jan 25 '17 at 02:49
Sorry you can only subtract sets from each other. Fixed. – Jamie Phan Jan 25 '17 at 02:53
Thanks for the help, and I will work on making it more elegant, I'm still in the early stages for this project and I want to put more time into it. Also I am new to python so I appreciate all the help. – Paul Marinos Jan 25 '17 at 03:00
Fixed! Sorry for giving such a poor answer at first, did one roughly in my lunch break and had a proper look at it. Answer should work now. (except the duplicates will also return the first occurrence of the value – Jamie Phan Jan 25 '17 at 03:23
line 7, in next(csv_reader, None) _csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode? – Paul Marinos Jan 25 '17 at 03:28
Sorry I'm not too sure without being able to see your CSV file, please see [this question](http://stackoverflow.com/questions/17315635/csv-new-line-character-seen-in-unquoted-field-error) for more details – Jamie Phan Jan 25 '17 at 03:31
Edit. Duplicate list no longer shows first occurrence of duplicated value. – Jamie Phan Jan 25 '17 at 03:44
Trying to clear up what I consider to be 'excess' / 'duplicate' / 'extra' values: The Should Have Column is the 'Correct' list of values, so we are comparing the 'Have' column to the Should Have. If the Should Have has the value 28.50 appear eight times, but the 'Have' column has it appear ten times, then I'd like to have 28.50 appear two times in the newly created 'Duplicate / Excess' column. Also, If the 'Should Have' has 28.50 appear eight times but the "Have" has it appear only six times, then there should be two 28.50s appear in a newly created "Missing" Column. – Paul Marinos Jan 25 '17 at 04:29
My CSV contains 2 columns, Have and Should Have with values under those headers. I'm looking into trying to get past this 'universal code' error so that I can actually run the code! – Paul Marinos Jan 25 '17 at 04:30
Understood - I'll update the `missing` list to do the functionality as you requested. As for the duplicates functionality the `have_excess` should do what you want (you should easily be able to merge the duplicates/excess lists!) Secondly, from your latter comment, it would be really helpful if you could show your CSV file, something like I have, as your initial picture shows 3 columns which I coded to accordingly – Jamie Phan Jan 25 '17 at 04:32
You are great thanks for the help. I think the miscommunication of my file was due to my lack of experience in formatting my questions here on stack overflow. I did not mean for there to be 3 columns in the original picture, but the first numbers are just the cell count in excell. I apologise, adding a screenshot of my file – Paul Marinos Jan 25 '17 at 04:48
No worries! Updated the answer. The current solution shows all row-ids of the values - you can modify the output to show 'how many have values excess' etc. I'll let you decide on what output you want! – Jamie Phan Jan 25 '17 at 04:57

Python: How to find duplicates from two CSV columns, and any extra or 'missing' values

1 Answers1