Encode Python lists as indexes of unique values

Question

I'd like to represent an arbitrary list as two other lists. The first, call it values, containing the unique elements in the original list, and the second, call it codes, containing the index in values of each element in the original list, in such a way that the original list could be reconstructed as

orig_list = [values[c] for c in codes]

(Note: this is similar to how pandas.Categorical represents series)

I've created the function below to do this decomposition:

def decompose(x):
    values = sorted(list(set(x)))
    codes = [0 for _ in x]
    for i, value in enumerate(values):
        codes = [i if elem == value else code for elem, code in zip(x, codes)]
    return values, codes

This works, but I would like to know if there is a better/more efficient way of achieving this (no double loop?), or if there's something in the standard library that could do this for me.

Update:

The answers below are great and a big improvement to my function. I've timed all that worked as intended:

test_list = [random.randint(1, 10) for _ in range(10000)]
functions = [decompose, decompose_boris1, decompose_boris2,
             decompose_alexander, decompose_stuart1, decompose_stuart2,
             decompose_dan1]
for f in functions:
    print("-- " + f.__name__)
    # test
    values, codes = f(test_list)
    decoded_list = [values[c] for c in codes]
    if decoded_list == test_list:
        print("Test passed")
        %timeit f(test_list)
    else:
        print("Test failed")

Results:

-- decompose
Test passed
12.4 ms ± 269 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
-- decompose_boris1
Test passed
1.69 ms ± 21.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
-- decompose_boris2
Test passed
1.63 ms ± 18.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
-- decompose_alexander
Test passed
681 µs ± 2.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
-- decompose_stuart1
Test passed
1.7 ms ± 3.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
-- decompose_stuart2
Test passed
682 µs ± 5.98 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
-- decompose_dan1
Test passed
896 µs ± 19.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I'm accepting Stuart's answer for being the simplest and one of the fastest.

I'm surprised you chose the least efficient answer as the accepted answer. — Boris Verkhovskiy, Dec 04 '19 at 07:33
@Boris, could you elaborate? The least efficient in which sense? I'm happy to change the selection as I'd like people with the same problem to choose the best solution, but I'd like to have a good criterion. — foglerit, Dec 04 '19 at 20:07
@foglerit try doing `decompose(list(range(100000)))` using the functions defined in each of the answers. The one you chose as the accepted answer takes like 30 seconds (and every extra element causes the code to iterate over the entire array, the amount of time it takes grows n^2 with the input), the three answers that have 1 upvote all execute in less than a second. — Boris Verkhovskiy, Dec 04 '19 at 20:12
The reason this happens is because of `[values.index(v) for v in x]`. `values.index` looks for the element by iterating over the entire array. In the worst case (every element is unique) it'll iterate over the entire array as many times as there are items in the array (well technically it'll iterate half that amount, because `index` returns as soon as it finds an element and doesn't bother iterating over the rest of the array). Whereas all the answers that use a dictionary find the index right away, so they only iterate over the array once or twice in total. — Boris Verkhovskiy, Dec 04 '19 at 20:28

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

I’m quite happy with this solution, although I am still trying to find a better one.

Code

def decompose(original_list: List[Any]) -> Tuple[List[int], Dict[int, Any]]: 
    code_to_elem = dict(enumerate(set(original_list)))
    elem_to_code = {v: k for k, v in code_to_elem.items()}
    encoded_list = [elem_to_code[elem] for elem in original_list]
    return encoded_list, code_to_elem

Test run

# t_list for test_list
t_list = [1, 2, 19, 3, 2, 19, 2, 3, 19, 1, 1, 3]

t_encoded, t_decoder = decompose(t_list)

t_decoded = [t_decoder[curr_code] for curr_code in t_encoded]

Here are the contents of the important variables:

t_list: [1, 2, 19, 3, 2, 19, 2, 3, 19, 1, 1, 3]
t_encoded: [1, 2, 3, 0, 2, 3, 2, 0, 3, 1, 1, 0]
t_decoder: {0: 3, 1: 1, 2: 2, 3: 19}
t_decoded: [1, 2, 19, 3, 2, 19, 2, 3, 19, 1, 1, 3]

Let me know if you have any questions :)

Dan D. · Answer 2 · 2019-12-03T14:31:58.513

This would count as an answer even if it is merely an improvement on Boris's answer.

I would use index_of_values.append(values.setdefault(elem, len(values))) as the loop body as that reduces three dict lookups to one and keeps the branch outside the interpreter. One might even create locals for the two methods to not repeatedly do lookups for them. But it seems that the savings of doing both is only 7%.

But using the insane looking values = defaultdict(lambda: len(values)) gives a 23%.

from collections import defaultdict

def decompose(x):
    values = defaultdict(lambda: len(values))
    index_of_values = []
    _append = index_of_values.append
    for elem in x:
        _append(values[elem])
    return list(values), index_of_values

It is even better if the loop is replaced by a map:

def decompose(x):
    values = defaultdict(lambda: len(values))
    index_of_values = list(map(values.__getitem__, x))
    return list(values), index_of_values

Gives 57%. I would have caught that if I had been looking at the output of the function. Also get evidently doesn't trigger the factory. I don't know why it doesn't.

If the dict does not retain insertion order:

return sorted(values, key=values.get), index_of_values

The version with map will give Nones because defaultdict won't call lambda you defined in it's constructor — marke, Dec 03 '19 at 09:27
`list(map(values.get, x))` should be changed to `list(map(values.__getitem__, x))` — marke, Dec 03 '19 at 10:27

score 0 · Answer 3 · answered Dec 03 '19 at 03:50

You can do this, IIUC, with cummin:

df['newgroup'] = df.reset_index().groupby('group')['index'].cummin()                                                                                                                            

In [1579]: df                                                                                                                                                                                              
Out[1579]: 
    group  newgroup
0       5         0
1       4         1
2       5         0
3       6         3
4       7         4
5       8         5
6       5         0
7       3         7
8       2         8
9       5         0
10      6         3
11      7         4
12      8         5
13      8         5
14      5         0

Stuart · Accepted Answer · 2019-12-04T14:11:42.500

0

You can use a simple index lookup:

def decompose(x):
    values = sorted(set(x))
    return values, [values.index(v) for v in x]

If more time-efficiency is needed (because x is very large) then this can be achieved (in exchange for some memory overhead) by representing values as a dictionary:

def decompose(x):
    values = sorted(set(x))
    d = {value: index for index, value in enumerate(values)}
    return values, [d[v] for v in x]

If sorting is not needed (or not possible for some reason) then replace sorted with list in the above.

edited Dec 04 '19 at 14:11

answered Dec 03 '19 at 05:15

Stuart

9,597
1
21
30

Isn't this O(n^2) (where n is number of unique elements)? – Boris Verkhovskiy Dec 03 '19 at 05:39
@Boris Wouldn’t it be O(n + n * (number of unique elems))? This is tougher to figure out than I thought it would be xd – AMC Dec 03 '19 at 05:43
Does sorting the values provide any benefit? – AMC Dec 03 '19 at 05:48
You should just replace the first snippet with the second and not mention the n^2 answer. – Boris Verkhovskiy Dec 03 '19 at 10:07
@Boris the first one uses less memory so might sometimes be preferred. – Stuart Dec 03 '19 at 12:43
1

If you have so many elements that you can't create a second copy of your list, there is no way the n² code will run in any reasonable amount of time. – Boris Verkhovskiy Dec 03 '19 at 12:46
Earlier I asked why you were sorting the values, I now realize why it caught my attention and bugged me. You’re placing restrictions (some pretty big ones, IMO) on the input: It must be possible to not only check if two elements are equal, but if one is greater or lesser than the other! That means, for example, that the list can only contain one of numbers, strings, or tuples. – AMC Dec 04 '19 at 06:40
@Boris Is it actually n^2 in the end? – AMC Dec 04 '19 at 06:41
@AlexanderCécile usually you don't get that pedantic with complexity analysis. You wouldn't mention the "n +" because that's just n*(n+1), the +1 doesn't matter, you care about the shape of the line. You also didn't consider the fact that `.index` does early stoping, so you forgot to divide by two (or something like that). "n * k" where n is the total number of elements and k is the number of unique elements would be more correct. In the worst case all the elements are unique, in which case it's just n*n so I just call it n^2. – Boris Verkhovskiy Dec 04 '19 at 07:02
@AlexanderCécile By default, objects can be sorted in Python (`sorted([lambda x: x*x, map, [1,2,3], {"a": 3}, object])` works) so the input is not restricted to numbers, strings or tuples only. But yes it is possible in principle there could be an unsortable object, in which case you should use a version that doesn't sort. – Stuart Dec 04 '19 at 14:10
@Stuart _print(sorted([lambda x: x*x, map, [1,2,3], {"a": 3}, object])) TypeError: '<' not supported between instances of 'type' and 'function'_ ? I struggled to find a way to explain it concisely, but what I meant is that you can’t have a list with numbers, strings and tuples. – AMC Dec 04 '19 at 14:14
As the OP's time tests show, the first function is just as fast as some of the more 'optimized' algorithms for `n=10000`, while the 2nd function is faster. That is because the overhead of creating an extra dictionary etc. in Python often outweighs the speed benefit of a less time-complex algorithm for a large range of likely inputs. [Don't optimize prematurely](https://stackoverflow.com/questions/385506/when-is-optimisation-premature), and if you need to optimize do it for real world inputs that the function will actually have to handle. – Stuart Dec 04 '19 at 14:18
@AlexanderCécile Ah, it works in Python 2 but not 3. Python 3 is a lot more restrictive about comparing different types, or (say) comparing 2 dictionaries. So yes, avoid sorting if you have that kind of input. – Stuart Dec 04 '19 at 14:26

Boris Verkhovskiy · Answer 5 · 2019-12-03T06:36:10.210

Create a dictionary where the keys are the unique values and they map to their index in the keys (dictionaries keep order starting with CPython 3.6). You do this by iterating over the list, if an element is not in the dictionary, you add it to the dictionary and map it to the length of the dictionary at the time you added it. Then you look up the element's index in the dictionary and append it to the list. Then you return just the keys, along with the list of indexes.

def decompose(x):
    values = {}
    index_of_values = [values.setdefault(elem, len(values)) for elem in x]
    return list(values), index_of_values

This is linear time and space complexity. Use it like this:

>>> decompose([2, 1, 1, 1, 131, 42, 2])
([2, 1, 131, 42], [0, 1, 1, 1, 2, 3, 0])

Using the side effect of a list comprehension is generally frowned upon, so you might want to write it out this function more explicitly:

def decompose(x):
    values = {}
    index_of_values = []
    for elem in x:
        if elem not in values:
            values[elem] = len(values)
        index_of_values.append(values[elem])
    return list(values), index_of_values

Govinda Malavipathirana · Answer 6 · 2019-12-03T05:29:06.253

-1

If you need some thing like pandas.Caterogical

arbi_arr=[1, 2, 3, 1, 2, 3]

value=list(dict.fromkey(arbi_arr))
code=list(range(0, len(arbi_arr)))

edited Dec 03 '19 at 05:29

answered Dec 03 '19 at 04:38

Govinda Malavipathirana

1,095
2
11
29

Range takes an integer, you’re giving it a list. The default start value is 0, so you can get rid of that first parameter. There’s a typo, it’s `fromkeyS` not `fromkey`. I fixed all those and ran the code, the result doesn’t makes much sense to me. I guessed that the range was meant to be `range(0, len(arbi_arr))`, maybe I was wrong. – AMC Dec 03 '19 at 04:52
Yes that was mistake, Range dose take the `len(arbi_arr))`. – Govinda Malavipathirana Dec 03 '19 at 05:31

Encode Python lists as indexes of unique values

6 Answers6

Code

Test run