0

Two columns ("Name" & "Value") in excel.

There are duplicates (eg. "xxa","xxf") in the Value column and the python script needs to find what are the duplicates cell values and put them into an array

enter image description here

The output should be

{
  "xxa": ["aaa","bbb","ccc","hhh"],
  "xxf": ["fff","jjj"]
}

How to improve the current script?

file = open('columnData.csv')
csvreader = csv.reader(file)
next(csvreader)

for row in csvreader:
    name = row[0]
    value = row[1]
    value_col.append(value)
    name_value_col.append(name+","+value)
file.close()
count={}
names=[]

for item in value_col:
    if value_col.count(item)>1:
        count[item]=value_col.count(item)

for name,value in count.items():
    names.append(name) 
total=[]

for item in name_value_col:
    item_name=item.split(",")     
    if item_name[1] in names:  
        total.append(item_name[0])
print(total)
kd88
  • 1,054
  • 10
  • 21
liuxu
  • 57
  • 4

1 Answers1

2

I'd recommend using defaultdict, and while you're at it using csv.DictReader makes for more legible code:

import csv
from collections import defaultdict

data = defaultdict(list)
with open('columnData.csv') as f:
    reader = csv.DictReader(f)
    for row in reader:
        data[row['Value']].append(row['Name'])

and then regarding duplicate finding you can EITHER take the destructive approach (pruning non-duplicates)

# Remove non-duplicates here
for key in list(data.keys()):  # note need to take a copy of the keys
    if len(data[key]) == 1:  # only one value in the list
        del data[key]

print(dict(data))

>>> {"xxa": ["aaa","bbb","ccc","hhh"], "xxf": ["fff","jjj"]}

or if you prefer a non-destructive approach to finding duplicates:

def _filter_duplicates(data):
    for key, value in data.items():
        if len(value) > 1:
            yield key, value

def find_duplicates(data):
    return dict(_filter_duplicates(data))

print(find_duplicates(data))

>>> {"xxa": ["aaa","bbb","ccc","hhh"], "xxf": ["fff","jjj"]}
kd88
  • 1,054
  • 10
  • 21
  • Clever solution! Is there something missing from this code? Where are the entries for "xxe", "xxi" or "xxd" for example? – eandklahn Dec 07 '21 at 13:56
  • 1
    Ah yes I forgot to remove non-dupes after the fact - which I've added in now! Although perhaps the OP would prefer to omit this with hindsight - I'll leave that up to them! – kd88 Dec 07 '21 at 13:57
  • Yes. Now, this is a more purist comment, but you don't really need the `list(data.keys())`, do you? I would think that `data.keys()` is enough. – eandklahn Dec 07 '21 at 14:21
  • 1
    Unfortunately that isn't permitted, since the dictionary would change size during iteration (which would throw the`.keys()` iterator out of order). You need to create a copy of the keys, and `list` is one such way, although there may be a less costly way of doing so using generators – kd88 Dec 07 '21 at 14:34
  • For posterity, I've added in one such mechanism – kd88 Dec 07 '21 at 14:41