I'd like to represent an arbitrary list as two other lists. The first, call it values
, containing the unique elements in the original list, and the second, call it codes
, containing the index in values
of each element in the original list, in such a way that the original list could be reconstructed as
orig_list = [values[c] for c in codes]
(Note: this is similar to how pandas.Categorical
represents series)
I've created the function below to do this decomposition:
def decompose(x):
values = sorted(list(set(x)))
codes = [0 for _ in x]
for i, value in enumerate(values):
codes = [i if elem == value else code for elem, code in zip(x, codes)]
return values, codes
This works, but I would like to know if there is a better/more efficient way of achieving this (no double loop?), or if there's something in the standard library that could do this for me.
Update:
The answers below are great and a big improvement to my function. I've timed all that worked as intended:
test_list = [random.randint(1, 10) for _ in range(10000)]
functions = [decompose, decompose_boris1, decompose_boris2,
decompose_alexander, decompose_stuart1, decompose_stuart2,
decompose_dan1]
for f in functions:
print("-- " + f.__name__)
# test
values, codes = f(test_list)
decoded_list = [values[c] for c in codes]
if decoded_list == test_list:
print("Test passed")
%timeit f(test_list)
else:
print("Test failed")
Results:
-- decompose
Test passed
12.4 ms ± 269 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
-- decompose_boris1
Test passed
1.69 ms ± 21.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
-- decompose_boris2
Test passed
1.63 ms ± 18.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
-- decompose_alexander
Test passed
681 µs ± 2.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
-- decompose_stuart1
Test passed
1.7 ms ± 3.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
-- decompose_stuart2
Test passed
682 µs ± 5.98 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
-- decompose_dan1
Test passed
896 µs ± 19.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I'm accepting Stuart's answer for being the simplest and one of the fastest.