0

I'm retrieving a list of (name, id) pairs and I need to make sure there's no duplicate of name, regardless of the id.

# Sample data
filesID = [{'name': 'file1', 'id': '353'}, {'name': 'file2', 'id': '154'},
           {'name': 'file3', 'id': '1874'}, {'name': 'file1', 'id': '14'}]

I managed to get the desired output with nested loops:

uniqueFilesIDLoops = []
for pair in filesID:
    found = False
    for d in uniqueFilesIDLoops:
        if d['name'] == pair['name']:
            found = True
    if not found:
        uniqueFilesIDLoops.append(pair)

But I can't get it to work with list comprehension. Here's what I've tried so far:

uniqueFilesIDComprehension = []
uniqueFilesIDComprehension = [
    pair for pair in filesID if pair['name'] not in [
        d['name'] for d in uniqueFilesIDComprehension
    ]
]

Outputs:

# Original data
[{'name': 'file1', 'id': '353'}, {'name': 'file2', 'id': '154'},
 {'name': 'file3', 'id': '1874'}, {'name': 'file1', 'id': '14'}]

# Data obtained with list comprehension
[{'name': 'file1', 'id': '353'}, {'name': 'file2', 'id': '154'},
 {'name': 'file3', 'id': '1874'}, {'name': 'file1', 'id': '14'}]

# Data obtained with loops (and desired output)
[{'name': 'file1', 'id': '353'}, {'name': 'file2', 'id': '154'},
 {'name': 'file3', 'id': '1874'}]

I was thinking that maybe the call to uniqueFilesIDComprehension inside the list comprehension was not updated at each iteration, thus using [] and not finding corresponding values.

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
Titouan L
  • 1,182
  • 1
  • 8
  • 24
  • Your thinking is exactly correct, up until the whole list comprehension is finished, `uniqueFilesIDComprehension` is nothing other than `[]` – matszwecja Mar 03 '22 at 13:28
  • Thank you @matszwecja, that's a bummer. Do you think it's possible to get the output with something else than the nested for loops ? – Titouan L Mar 03 '22 at 13:33
  • 1
    this can't be done in a single expression, but with a lot of (very advanced) code contortion - but more importante, if yu need to checxk for duplicates use a `set` or a `dict` having the names as keys: this will make your code thousands of times faster as soon as you are up to a few hundred names – jsbueno Mar 03 '22 at 13:34
  • 1
    `{el['name'] : el for el in filesID}.values()` should do the job. (This will keep last item encountered in case of duplicates, not the first) – matszwecja Mar 03 '22 at 13:37
  • @matszwecja that's it ! I've cast this to a `list` and got the expected output. Thanks you very much, you can change your comments to a complete answer if you want, I'll accept it. As a side note, it take the last variable of each duplicate, unlike the for loops, but like I said, `id` doesnt matter in my case. – Titouan L Mar 03 '22 at 13:41

3 Answers3

1

I would stick with your original loop, although note that it can be made a little cleaner. Namely, you don't need a flag named found.

uniqueFilesIDLoops = []
for pair in filesID:
    for d in uniqueFilesIDLoops:
        if d['name'] == pair['name']:
            break
    else:
        uniqueFilesIDLoops.append(pair)

You can also use an auxiliary set to simplify detecting duplicate names (since they are str values and therefore hashable).

seen = set()
uniqueFilesIDLoops = []
for pair in filesID:
    if (name := pair['name']) not in seen:
        seen.add(name)
        uniqueFilesIDLoops.append(pair)

Because we've now decoupled the result from the data structure we perform lookups in, the above could be turned into a list comprehension by writing an expression that both returns True when the name is not in the set and adds the name to the set. Something iffy like

seen = set()
uniqueFilesIDLoops = [pair 
                      for pair in filesID
                      if (pair['name'] not in seen
                          and (seen.add(pair['name']) or True))]

(seen.add always returns None, which is a falsey value, so seen.add(...) or True is always True.)

chepner
  • 497,756
  • 71
  • 530
  • 681
1

You cannot access contents of list comprehension during its creation, because it will be assigned to anything only after its value is completely evaluated.

Simpliest way to remove duplicates would be:

list({el['name'] : el for el in filesID}.values()) - this will create a dictionary based on the names of each element, so every time you encounter duplicate name it will overwrite it with a new element. After the dict is created all you need to do is get the values and cast it to list. If you want to keep the first element with each name, not the last you can instead do it by creating the dictionary in a for loop:

out = {}
for el in filesID:
    if el['name'] not in out:
        out[el['name']] = el

And finally, one thing to consider when implementing any of those solutions - since you do not care about id part, do you really need to extract it?

I'd ask myself if this is not a valid choice as well.

out = {el['name'] for el in filesID}
print(out)

Output: {'file1', 'file3', 'file2'}

matszwecja
  • 6,357
  • 2
  • 10
  • 17
  • The `id` matters to me, even if I dont care which one since multiple ids can target the same file. My questioning about this operation comes precisely from this specificity. – Titouan L Mar 03 '22 at 14:09
0

List comprehensions are used to create new lists, so the original list is never updated; the assignment causes the variable to refer to the newly created list.

Scott Hunter
  • 48,888
  • 12
  • 60
  • 101