4

I would like to turn the loop over the items in L in following code:

L = [1,2,5,2,1,1,3,4]
L_unique = []
for item in L:
    if item not in L_unique: 
        L_unique.append(item)

to a list comprehension like:

L_unique = [ item for item in L if item not in ???self??? ]

Is this possible in Python? And if it is possible, how can it be done?

Chris
  • 26,361
  • 5
  • 21
  • 42
Claudio
  • 7,474
  • 3
  • 18
  • 48
  • No, that's not possible. Use the regular loop. – jonrsharpe Sep 17 '22 at 07:31
  • No I don't think so, cause the object will be created and assigned after iterating the comprehension – ThePyGuy Sep 17 '22 at 07:31
  • Impossible, at least not now. But in this example, loops can be avoided: `L_unique = list(dict.fromkeys(L))` – Mechanic Pig Sep 17 '22 at 07:31
  • @ThePyGuy : I suppose that somewhere must exist an object storing the current state of the comprehension, so knowing how to address that object will allow to put the right descriptor in place of ???self??? Makes taking a look into the C-code sense trying to find it out or is there no such object at the level of Python objects with a name stored in some descriptor dictionary? – Claudio Sep 17 '22 at 07:43

5 Answers5

13

It's possible. Here's a hack that does it, but I wouldn't use this in practice as it's nasty and depends on implementation details that might change and I believe it's not thread-safe, either. Just to demonstrate that it's possible.

You're mostly right with your "somewhere must exist an object storing the current state of the comprehension" (although it doesn't necessarily have to be a Python list object, Python could store the elements some other way and create the list object only afterwards).

We can find the new list object in the objects tracked by garbage collection. Collect the IDs of lists before the comprehension's list is created, then look again and take the one that wasn't there before.

Demo:

import gc

L = [1,2,5,2,1,1,3,4]

L_unique = [
    item

    # the hack to get self
    for ids in [{id(o) for o in gc.get_objects() if type(o) is list}]
    for self in [o for o in gc.get_objects() if type(o) is list and id(o) not in ids]

    for item in L
    if item not in self
]

print(L_unique)

Output (Try it online!):

[1, 2, 5, 3, 4]

Worked both in the linked site's Python 3.8 pre-release version and in Python 3.10.2 somewhere else.

For an alternative with the exact style you asked, only replacing your ???self???, see Mechanic Pig's updated answer.

Kelly Bundy
  • 23,480
  • 7
  • 29
  • 65
  • Thanks for the very interesting insight into the internals of Python with which you have proven so many others wrong. By the way: isn't it so that garbage collection must update all references to objects so there is no risk that `self` suddenly won't point to the right object if garbage collection interfere with the list comprehension? The only risk I see is that there is no guarantee that this will work in future versions of Python, but ... it's not unusual that new Python versions make some previous code requiring an update. – Claudio Sep 17 '22 at 16:57
  • Hmm, sorry, I don't understand what you mean with "update all references" and "won't point to the right object". Maybe rephrase? The main risk **I** meant is that I rely on my two searches to happen right before and after the creation of the comprehension's list. That's not about GC but about how list comprehensions are executed. Let's look at the bytecode of a [simpler one](https://tio.run/##K6gsycjPM7YoKPr/PzO3IL@oRCEls5iLC0joAbGGenSVQlp@kUKFQmaeQnRSKpCdGgsWqQSLJKaVpBbFxqpr/v8PAA): First `before` gets loaded, then the empty list is built, then `after` gets loaded. But that could change. – Kelly Bundy Sep 18 '22 at 01:06
  • The garbage collection, if moving something in memory, updates the 'pointers' the identifier point to, so also this one for the created list will be updated and valid if garbage collection interfere with the list comprehension. Thanks for your further explanations. It's hard to find the right phrases the other can easily understand ... *"Lost in translation"* is an excellent example of a movie making a point about it. – Claudio Sep 18 '22 at 01:23
  • @Claudio Objects don't get moved in memory, neither by garbage collection nor by anything else. Thus there's also no need to update references. Where did you hear about that? – Kelly Bundy Sep 18 '22 at 02:02
  • I mean to learned that some decades ago (MS DOS) in relation to C and assembler where it was necessary to calculate absolute addresses based on the base address the program was given when it was started or use only relative addressing. With memory not GByte but KByte. Do modern OSs allow the garbage collection to leave any number of gaps when the memory is freed so there is no need for relocation? – Claudio Sep 18 '22 at 05:06
  • @Claudio I'm no expert in that area, but yes, I do think that "gaps" are ok. [`id`](https://docs.python.org/3/library/functions.html#id) gives you a number that's *"constant for this object during its lifetime"*, and in CPython, it's *"the address of the object in memory"*. Hence an object's address can't change. And `print(id(object))` [gives me](https://tio.run/##K6gsycjPM7YoKPr/v6AoM69EIzNFIz8pKzW5RFPz/38A) addresses like 140614797775488, which is at around 140 **tera**bytes. I'm sure that's not a physical address but just a virtual one. Virtual memory pages maybe ... – Kelly Bundy Sep 18 '22 at 14:57
  • ...get moved around in physical memory, but I think that's at a lower level than where Python and its garbage collector operate. – Kelly Bundy Sep 18 '22 at 14:57
  • There are two kinds of addresses in Python: long ones for own identifiers and short ones for Pythons own ones. TIo does not display the short ones for some reason, but try for example to run in Python `print(id(object)); print(id('x')); print(id(None))`. It will give: `9862016 140578478110704 9842304`. So TIo fooled you a bit about the id(object) ... – Claudio Sep 18 '22 at 15:26
  • @Claudio I get equally large addresses for all three, like [140553094029952 140553077934000 140553094035840](https://tio.run/##K6gsycjPM7YoKPr/v6AoM69EIzNFIz8pKzW5RFPTWgEupF6hjsL3y89L1dT8/x8A). Not sure what you mean with fooled me. If it did, it fooled *Python*. – Kelly Bundy Sep 18 '22 at 15:35
  • Run it in IDLE or from file to see what I mean. It's always that way. I was surprised to see in TIo that it returns only the long ones what is not correct. I have told you that TIo gives wrong results on it for some not clear to me reasons. – Claudio Sep 18 '22 at 15:43
  • @Claudio I don't have a functional PC at the moment, only a phone, so I use TIO. What makes you think the results are *wrong*? You got 140578478110704 for one value yourself. – Kelly Bundy Sep 18 '22 at 15:50
  • I have observed it already many times studying id()s. The Pythons builtins have short id()s, self-defined identifiers long ones. So I conclude that at runtime allocated objects get the long id()s and these initialized by Python before running the code the short ones. – Claudio Sep 18 '22 at 16:42
  • Maybe you're running a newer Python version and Python changed that at some point, or it's different operating systems putting code and its "embedded data" at the start of memory or not. I don't think TIO is somehow lying about it (or even able to). – Kelly Bundy Sep 18 '22 at 19:43
  • I have checked it with python3.9. and 3.6 and then with python2.7. It was always that way. – Claudio Sep 18 '22 at 20:30
12

List comprehension actually makes an anonymous function then call it, but the list built will not be stored in local variable dictionary, but in the stack maintained in Python, so few Python level operations can obtain this list (gc is a crazy but feasible choice. Sorry for my exaggeration before, the solution using gc is attached at the end):

>>> [locals().copy() for i in range(3)]
[{'.0': <range_iterator at 0x207eeaca730>, 'i': 0},    # does not contain the built list
 {'.0': <range_iterator at 0x207eeaca730>, 'i': 1},
 {'.0': <range_iterator at 0x207eeaca730>, 'i': 2}]
>>> dis('[i for i in iterable]')
  1           0 LOAD_CONST               0 (<code object <listcomp> at 0x00000211FEAFD000, file "<dis>", line 1>)
              2 LOAD_CONST               1 ('<listcomp>')
              4 MAKE_FUNCTION            0
              6 LOAD_NAME                0 (iterable)
              8 GET_ITER
             10 CALL_FUNCTION            1
             12 RETURN_VALUE

Disassembly of <code object <listcomp> at 0x00000211FEAFD000, file "<dis>", line 1>:
  1           0 BUILD_LIST               0    # build an empty list and push it onto the stack
              2 LOAD_FAST                0 (.0)
        >>    4 FOR_ITER                 4 (to 14)
              6 STORE_FAST               1 (i)
              8 LOAD_FAST                1 (i)
             10 LIST_APPEND              2     # get the built list through stack and index
             12 JUMP_ABSOLUTE            2 (to 4)
        >>   14 RETURN_VALUE

For the example you provided, you can use list(dict.fromkeys(L)) to get the same results in Python 3.7+. Here I use dict instead of set because dict can preserve the insertion order:

>>> list(dict.fromkeys(L))
[1, 2, 5, 3, 4]

According to @KellyBundy , the current method I have found is to use gc.get_objects, but this operation is very expensive (because it collects more than 1000 objects) and I can't determine its accuracy:

>>> [item for item in L if item not in gc.get_objects(0)[-1]]
[1, 2, 5, 3, 4]

Making operations cheaper through caching:

>>> lst = None
>>> [item for item in L if item not in (lst := gc.get_objects(0)[-1] if lst is None else lst)]
[1, 2, 5, 3, 4]
Mechanic Pig
  • 6,756
  • 3
  • 10
  • 31
  • 2
    dict.fromkeys() is superior to the set() solution suggested in other answers – DarkKnight Sep 17 '22 at 08:10
  • Which Python stack do you mean? This one which is controlled by `sys.setrecursionlimit()` or another one? How many different stacks are used in Python? – Claudio Sep 17 '22 at 08:21
  • @Claudio Yes, it is. – Mechanic Pig Sep 17 '22 at 08:25
  • @MechanicPig: I have set recursion limit down and made the list L large ... there is no RecursionError, so the list comprehension does not put its list on this stack, right? – Claudio Sep 17 '22 at 08:28
  • @Claudio The stack will only hold references to the list, not all elements of the list. – Mechanic Pig Sep 17 '22 at 08:29
  • What do you mean with "this list cannot be obtained in Python level operations anyway"? That you'd have to write C code? I've done this before, with Python code, finding the object in `gc`. – Kelly Bundy Sep 17 '22 at 13:14
  • @KellyBundy Well, this may be a bit exaggerated, and `gc` is indeed a possible method. – Mechanic Pig Sep 17 '22 at 13:17
  • @KellyBundy `gc.get_objects(0)[-1]` seems to be a method. Is there any cheaper operation? – Mechanic Pig Sep 17 '22 at 13:34
  • 1
    @MechanicPig That's already cheaper than the one I posted now :-) (at least it would be if you looked up the object once and stored it in a variable instead of calling `gc.get_objects(0)` repeatedly). I think something like that is what I did the first time I did it. But I'd trust it even less, as I believe there's a chance it triggered a collection and isn't at that exact spot. Anyway, I don't think it matters, as neither one should be used in practice anyway. – Kelly Bundy Sep 17 '22 at 13:45
  • Thanks for the insight into Python disassembly and the short version of the solution using caching you can check out online here: https://tio.run/##NU3RCsIwDHzvV@SxhSrOKYiwPxj@wCgDRzYjW9u12YNfX9uJhIPk7nLnP/xytr75kBIt3gWGaRCihQa6Sp/1NaPKU@uLEWKOnIWHsyjafrO0bliMxLjA6ALsC1logcbfYR0XQpbPe5OzjxNy755vHDjKk@oOlSnmolPcowHniIVQRvhAluW/S6X0BQ – Claudio Sep 17 '22 at 17:32
  • @Mechanic Pig are there any disadvantages by an explicit call to the gc with `gc.collect(0)` before the initialization of `L`? – cards Oct 29 '22 at 22:59
  • 1
    @cards The list has been initialized at this time, because the call to `gc.collect(0)` is after the execution of `BUILD_LIST`. There should be no surprises when using this list. However, it should be noted that `gc.collect(0)` is a very expensive operation, and there is no guarantee that `gc.collect(0)[-1]` will be able to obtain the objects built by the comprehension (It should be a coincidence here. If the structure of the comprehension is different, the method is likely to fail.). – Mechanic Pig Oct 30 '22 at 05:32
4

While what you're asking for is highly impractical with a list comprehension, as L_unique does not exist until after the list comprehension completes, you can use a set comprehension.

L = [1,2,5,2,1,1,3,4]
L_unique = {x for x in L}

This is flexible if you wish to apply some other function to x, but in this simple form, you're better off with just:

L = [1,2,5,2,1,1,3,4]
L_unique = set(L)

If needed, a set can be converted back to a list.

L = [1,2,5,2,1,1,3,4]
L_unique = list(set(L))

Using a set may change the order of the elements compared to the original list.

Chris
  • 26,361
  • 5
  • 21
  • 42
1

So everything has been changed :)

gc.get_objects(generation=None)

Returns a list of all objects tracked by the collector, excluding the list returned. If generation is not None, return only the objects tracked by the collector that are in that generation.

Changed in version 3.8: New generation parameter.

Raises an auditing event gc.get_objects with argument generation.

Without using gc

A simple way:

L = [1,2,5,2,1,1,3,4]
L_unique = []

# This returns just a list of None
_ = [L_unique.append(i) for i in L if i not in L_unique]
L_unique

Output:

[1, 2, 3, 4, 5]

Or you can use this:

L = [1,2,5,2,1,1,3,4]
list(set(L))

Output:

[1, 2, 3, 4, 5]
Shahab Rahnama
  • 982
  • 1
  • 7
  • 14
  • 2
    As you can see from the other given answers your statement *You cannot reference the list itself in a list comprehension.* turned out not to be true and your answer therefore obsolete. – Claudio Sep 17 '22 at 16:46
0

If I had to remove duplicate elements from a list, and I wanted to preserve the order of the elements in the list, I would use the unique_everseen() function from the more_itertools library:

>>> from more_itertools import unique_everseen
>>> L = [1,2,5,2,1,1,3,4]
>>> list(unique_everseen(L))
[1, 2, 5, 3, 4]

This is similar to the set() approach that several others have suggested, but guaranteed to preserved order.

Kale Kundert
  • 1,144
  • 6
  • 18