1

I got a csv in the following fashion with 120000 rows:

ID Duplicate
1 65
2 67
4 12
4 53
4 101
12 4
12 53
101 ...

This list basically specifies a number of user ids, and users which are duplicates of that user. How the list is made up now I cant really filter this out in Excel, therefore I am trying to transform the list with this outcome:

[1, 65]
[2, 67]
[4, 12, 53, 101]

Afterwards I would be able to write into a new csv deleting only list[0] for each element, so that I can retain one user per "duplicate user block". In the Excel I would then delete all remaining user IDs.

However to come to this point I got a few problems:

import csv

with open("contacts.csv", "rt") as f:
    reader = csv.reader(f, delimiter="\t")

    contacts = []
    for row in reader:
        if row[0] not in contacts:
            contacts.append(row[0])
        if row[1] not in contacts:
            position = contacts.index(row[0])
            contacts[position].append(row[1])

Of course I get the error "AttributeError: 'str' object has no attribute 'append'" as contacts[position] is a string. But how can I change the code, so that I get a list for each block of duplicate contacts?

Thanks!

rStorms
  • 1,036
  • 2
  • 11
  • 23

2 Answers2

1

There is an almost one liner in standard python too

import csv
from itertools import groupby

with open("contacts.csv", "rt") as f:
    reader = csv.reader(f, delimiter="\t")
    contacts = [[k] + [r[1] for r in g] for k, g in groupby(reader, key=lambda row: row[0])]

I also like the pandas solution, but it means learning a new api.

Gribouillis
  • 2,230
  • 1
  • 9
  • 14
  • Thanks! Got the same problem as the solution before. The end goal is to retain one user from each "duplicate user block". In the output of you solution I for example get ['3001464', '3001465'], ['3001465', '3001464'] (real file!). If I delete the first element from each list to get only the duplicate ones, this wouldnt work ;) Any idea? In the list I have given above: [1, 65] [2, 67] [4, 12, 53, 101] I want to delete 1, 2 and 4, so I can tell Excel to filter all users 65, 67, 12, 53 and 101, thereby retaining one original not duplicate user. – rStorms Jul 17 '17 at 19:04
  • @user2252633 You're looking for the *connected components of a graph*. There are threads about this such as this one https://stackoverflow.com/questions/10301000/python-connected-components . You can also use external libraries such as networkx or igraph that have built in functions for this. – Gribouillis Jul 17 '17 at 19:46
1

This will work even if your csv file is not sorted and if you're missing a few entries:

with open('contacts.csv') as infile:
    data = {}
    for i,dup in csv.reader(infile):
        if i not in data:
            if dup in data:
                data[i] = data[dup]
                continue

            data[i] = set((i, dup))
            data[dup] = data[i]
            continue

        data[i].add(dup)

for _,dups in data.items():
    print(sorted(dups))

UPDATE: If you want to avoid printing multiple sets of duplicates:

for k,dups in list(data.items()):
    if k not in data: continue
    print(sorted(dups))
    for d in dups: data.pop(d)
inspectorG4dget
  • 110,290
  • 27
  • 149
  • 241
  • I get this error: Traceback (most recent call last): File "duplicate.py", line 5, in for i,dup in csv.reader(infile): ValueError: not enough values to unpack (expected 2, got 1) Probably something with the CSV file..I got a space between two values of a row and afterwards a new line..Any pointers? – rStorms Jul 17 '17 at 18:59
  • @user2252633: that seems like an error caused be an empty line at the end of your file. Deleting that should fix it. Or if your csv file stops being comma separated at some point, then that could also cause this error – inspectorG4dget Jul 17 '17 at 19:10
  • Hm, still not working, tried both things you mentioned. I also saved the csv differently, so that I got a comma between each number: Contact No_,Duplicate Contact No_ 1000004,1000588 1000009,1000489 1000009,1000531 1000009,1000643 Only now I get: Traceback (most recent call last): File "duplicate.py", line 17, in for _,dups in data: ValueError: too many values to unpack (expected 2) – rStorms Jul 17 '17 at 19:23
  • @user2252633: sorry, that was a silly error on my part. It's fixed now – inspectorG4dget Jul 17 '17 at 20:20
  • Thank you for your help, but this is not the output I need sadly. I want all users that are duplicates of each other in one list. Those user then can never show up again in another list, otherweise I have no way of deleting one original user from each block, to then afterwards delete all other users.. – rStorms Jul 18 '17 at 13:27
  • I don't think this produces the connected components of the graph described by the csv file in all cases. I mean if you have `(i, j)` and `(j, k)` in the csv file, but not `(i, k)`. I don't think this will produce the set `(i, j, k)`. – Gribouillis Jul 18 '17 at 15:02