Wasn't too sure what all possible cases are judging from the question, for example what will constitute "missing"? Why isn't rows 2, 3, and 5 not listed despite the "Have" value not equaling the "Should Have" value?
Anyways, hope this serves as a useful guide. You can read more about CSV processing in the reference
import csv
with open("test.csv", 'r') as fd:
csv_reader = csv.reader(fd)
#Skip the header
next(csv_reader, None)
#init arrays
have, should_have = {}, {}
row_counter = 0
for row in csv_reader:
have.setdefault(float(row[0]),[]).append(row_counter)
try:
should_have.setdefault(float(row[1]),[]).append(row_counter)
except IndexError:
continue
row_counter += 1
extras = [(k,v) for k, v in have.items() if k not in should_have]
have_excess = []
for k, v in have.items():
if ((k,v) not in extras) and (len(v) > 1):
if (len(v) > len(should_have[k])):
have_excess.append((k,v))
missing = []
for k, v in should_have.items():
if k not in have:
missing.append((k,v))
elif (len(v) > len(have[k])):
missing.append((k,v))
File:
~$ cat test.csv
have , should-have
200 , 88.74
42.5 , 42.5
2.99 , 88.74
88.74 , 2.99
200 , 200
200 , 8.7
8.7 , 9.99
1000 , 200
200
88.74
2000, 88.74
Output:
>> print(extras)
# [(2000.0, [8]), (1000.0, [7])]
>> print(missing)
# [(88.74, [0, 2, 8]), (9.99, [6])]
>> print(duplicates)
# [(200.0, [0, 4, 5, 8])]
Some notes:
- The
dict.setdefault(key,[]).append(value)
method allows us to easily append (or create a new list) to a key, so that we can store multiple row-ids for each value (see this question)
- use
iteritems()
if you're using Python 2.x and items()
for Python 3.x
- I formatted the output lists with (key, value) tuples. The reason being is that I was not sure which row-ids you would like to keep/discard, so left them all in there! To index them
tupl = output_list[#]
will give you one tuple, and tupl[1]
or tupl[0]
will give you the value or list of row-ids respectively.