How to I factorize a list of tuples?

Question

definition
factorize: Map each unique object into a unique integer. Typically, the range of integers mapped to is from zero to the n - 1 where n is the number of unique objects. Two variations are typical as well. Type 1 is where the numbering occurs in the order in which unique objects are identified. Type 2 is where unique objects are first sorted and then the same process as in Type 1 is applied.

The Setup
Consider the list of tuples tups

tups = [(1, 2), ('a', 'b'), (3, 4), ('c', 5), (6, 'd'), ('a', 'b'), (3, 4)]

I want to factorize this into

[0, 1, 2, 3, 4, 1, 2]

I know there are many ways to do this. However, I want to do this as efficiently as possible.

What I've tried

pandas.factorize and get an error...

pd.factorize(tups)[0]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-84-c84947ac948c> in <module>()
----> 1 pd.factorize(tups)[0]

//anaconda/envs/3.6/lib/python3.6/site-packages/pandas/core/algorithms.py in factorize(values, sort, order, na_sentinel, size_hint)
    553     uniques = vec_klass()
    554     check_nulls = not is_integer_dtype(original)
--> 555     labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls)
    556 
    557     labels = _ensure_platform_int(labels)

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_labels (pandas/_libs/hashtable.c:21804)()

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

Or numpy.unique and get incorrect result...

np.unique(tups, return_inverse=1)[1]

array([0, 1, 6, 7, 2, 3, 8, 4, 5, 9, 6, 7, 2, 3])

I could use either of these on the hashes of the tuples

pd.factorize([hash(t) for t in tups])[0]

array([0, 1, 2, 3, 4, 1, 2])

Yay! That's what I wanted... so what's the problem?

First Problem
Look at the performance drop from this technique

lst = [10, 7, 4, 33, 1005, 7, 4]

%timeit pd.factorize(lst * 1000)[0]
1000 loops, best of 3: 506 µs per loop

%timeit pd.factorize([hash(i) for i in lst * 1000])[0]
1000 loops, best of 3: 937 µs per loop

Second Problem
Hashing is not guaranteed unique!

Question
What is a super fast way to factorize a list of tuples?

Timing
both axes are in log space

code

from itertools import count

def champ(tups):
    d = {}
    c = count()
    return np.array(
        [d[tup] if tup in d else d.setdefault(tup, next(c)) for tup in tups]
    )

def root(tups):
    return pd.Series(tups).factorize()[0]

def iobe(tups):
    return np.unique(tups, return_inverse=True, axis=0)[1]

def get_row_view(a):
    void_dt = np.dtype((np.void, a.dtype.itemsize * np.prod(a.shape[1:])))
    a = np.ascontiguousarray(a)
    return a.reshape(a.shape[0], -1).view(void_dt).ravel()

def diva(tups):
    return np.unique(get_row_view(np.array(tups)), return_inverse=1)[1]

def gdib(tups):
    return pd.factorize([str(t) for t in tups])[0]

from string import ascii_letters

def tups_creator_1(size, len_of_str=3, num_ints_to_choose_from=1000, seed=None):
    c = len_of_str
    n = num_ints_to_choose_from
    np.random.seed(seed)
    d = pd.DataFrame(np.random.choice(list(ascii_letters), (size, c))).sum(1).tolist()
    i = np.random.randint(n, size=size)
    return list(zip(d, i))

results = pd.DataFrame(
    index=pd.Index([100, 1000, 5000, 10000, 20000, 30000, 40000, 50000], name='Size'),
    columns=pd.Index('champ root iobe diva gdib'.split(), name='Method')
)

for i in results.index:
    tups = tups_creator_1(i, max(1, int(np.log10(i))), max(10, i // 10))
    for j in results.columns:
        stmt = '{}(tups)'.format(j)
        setup = 'from __main__ import {}, tups'.format(j)
        results.set_value(i, j, timeit(stmt, setup, number=100) / 100)

results.plot(title='Avg Seconds', logx=True, logy=True)

Would you have to maintain the order like that or something like : `[0, 3, 1, 4, 2, 3, 1]` would be okay too? — Divakar, May 26 '17 at 18:23
@Divakar currently I don't care. You can choose which ever is more convenient. — piRSquared, May 26 '17 at 18:30
I think we would need a better benchmarking that has both strings and numbers and and of course should be big enough and number of duplicates would be consistent with those in the sample case. — Divakar, May 26 '17 at 19:16
For us non-Pandas users, it would be nice if you described what you mean by `factorize`, especially if you want `numpy` based answers. — hpaulj, May 26 '17 at 19:35

AChampion · Answer 1 · 2017-05-27T14:45:07.937

8

A simple way to do it is use a dict to hold previous visits:

>>> d = {}
>>> [d.setdefault(tup, i) for i, tup in enumerate(tups)]
[0, 1, 2, 3, 4, 1, 2]

If you need to keep the numbers sequential then a slight change:

>>> from itertools import count
>>> c = count()
>>> [d[tup] if tup in d else d.setdefault(tup, next(c)) for tup in tups]
[0, 1, 2, 3, 4, 1, 2, 5]

Or alternatively written:

>>> [d.get(tup) or d.setdefault(tup, next(c)) for tup in tups]
[0, 1, 2, 3, 4, 1, 2, 5]

edited May 27 '17 at 14:45

answered May 26 '17 at 18:29

AChampion

29,683
4
59
75

This is an extremely good answer! Thanks again. See my updated question for timing. – piRSquared May 26 '17 at 20:07

score 7 · Accepted Answer · answered May 26 '17 at 18:40

7

Initialize your list of tuples as a Series, then call factorize:

pd.Series(tups).factorize()[0]

[0 1 2 3 4 1 2]

answered May 26 '17 at 18:40

root

32,715
6
74
87

This might be the perfect answer. My only suspicion is on the overhead from `pd.Series` construction. – piRSquared May 26 '17 at 18:57

hpaulj · Answer 3 · 2017-05-27T16:17:33.567

@AChampion's use of setdefault got me wondering whether defaultdict could be used for this problem. So cribbing freely from AC's answer:

In [189]: tups = [(1, 2), ('a', 'b'), (3, 4), ('c', 5), (6, 'd'), ('a', 'b'), (3, 4)]

In [190]: import collections
In [191]: import itertools
In [192]: cnt = itertools.count()
In [193]: dd = collections.defaultdict(lambda : next(cnt))

In [194]: [dd[t] for t in tups]
Out[194]: [0, 1, 2, 3, 4, 1, 2]

Timings in other SO questions show that defaultdict is somewhat slower than the direct use of setdefault. Still the brevity of this approach is attractive.

In [196]: dd
Out[196]: 
defaultdict(<function __main__.<lambda>>,
            {(1, 2): 0, (3, 4): 2, ('a', 'b'): 1, (6, 'd'): 4, ('c', 5): 3})

Divakar · Answer 4 · 2017-05-26T19:32:14.987

Approach #1

Convert each tuple to a row of a 2D array, view each of those rows as one scalar using the views concept of NumPy ndarray and finally use np.unique(... return_inverse=True) to factorize -

np.unique(get_row_view(np.array(tups)), return_inverse=1)[1]

get_row_view is taken from here.

Sample run -

In [23]: tups
Out[23]: [(1, 2), ('a', 'b'), (3, 4), ('c', 5), (6, 'd'), ('a', 'b'), (3, 4)]

In [24]: np.unique(get_row_view(np.array(tups)), return_inverse=1)[1]
Out[24]: array([0, 3, 1, 4, 2, 3, 1])

Approach #2

def argsort_unique(idx):
    # Original idea : https://stackoverflow.com/a/41242285/3293881 
    n = idx.size
    sidx = np.empty(n,dtype=int)
    sidx[idx] = np.arange(n)
    return sidx

def unique_return_inverse_tuples(tups):
    a = np.array(tups)
    sidx = np.lexsort(a.T)
    b = a[sidx]
    mask0 = ~((b[1:,0] == b[:-1,0]) & (b[1:,1] == b[:-1,1]))
    ids = np.concatenate(([0], mask0  ))
    np.cumsum(ids, out=ids)
    return ids[argsort_unique(sidx)]

Sample run -

In [69]: tups
Out[69]: [(1, 2), ('a', 'b'), (3, 4), ('c', 5), (6, 'd'), ('a', 'b'), (3, 4)]

In [70]: unique_return_inverse_tuples(tups)
Out[70]: array([0, 3, 1, 2, 4, 3, 1])

I haven't tested anything yet. However, I believe that `np.unique` sorts when you use `return_inverse=1`. That makes this *O(nlogn)*. Correct me if I'm wrong. — piRSquared, May 26 '17 at 18:40
@piRSquared Well that array conversion itself looks like the bottleneck on this one. Doesn't look like NumPy is the best choice here given the mixed type data. — Divakar, May 26 '17 at 18:49

ImportanceOfBeingErnest · Answer 5 · 2017-05-26T22:49:08.900

2

I don't know about timings, but a simple approach would be using numpy.unique along the respective axes.

tups = [(1, 2), ('a', 'b'), (3, 4), ('c', 5), (6, 'd'), ('a', 'b'), (3, 4)]
res = np.unique(tups, return_inverse=1, axis=0)
print res

which yields

(array([['1', '2'],
       ['3', '4'],
       ['6', 'd'],
       ['a', 'b'],
       ['c', '5']],
      dtype='|S11'), array([0, 3, 1, 4, 2, 3, 1], dtype=int64))

The array is automatically sorted, but that should not be a problem.

edited May 26 '17 at 22:49

answered May 26 '17 at 18:52

ImportanceOfBeingErnest

321,279
53
665
712

How did I miss the `axis` argument of `np.unique`!! Thank you! Still the same criticism of @Divikar's answer. I believe this is *O(nlongn)* Which won't be as quick as `pd.factorize`. I'll test it though and see. – piRSquared May 26 '17 at 18:56
This only works for numpy 1.13... 1.12 didn't have axis – Gerges May 26 '17 at 18:58
@GergesDib and that's why I missed it :-) – piRSquared May 26 '17 at 19:00

Gerges · Answer 6 · 2017-05-26T19:12:55.417

I was going to give this answer

pd.factorize([str(x) for x in tups])

However, after running some test, it did not pan out to be the fastest of them all. Since I already did the work, I will show it here for comparison:

@AChampion

%timeit [d[tup] if tup in d else d.setdefault(tup, next(c)) for tup in tups]
1000000 loops, best of 3: 1.66 µs per loop

@Divakar

%timeit np.unique(get_row_view(np.array(tups)), return_inverse=1)[1]
# 10000 loops, best of 3: 58.1 µs per loop

@self

%timeit pd.factorize([str(x) for x in tups])
# 10000 loops, best of 3: 65.6 µs per loop

@root

%timeit pd.Series(tups).factorize()[0] 
# 1000 loops, best of 3: 199 µs per loop

EDIT

For large data with 100K entries, we have:

tups = [(np.random.randint(0, 10), np.random.randint(0, 10)) for i in range(100000)]

@root

%timeit pd.Series(tups).factorize()[0] 
100 loops, best of 3: 10.9 ms per loop

@AChampion

%timeit [d[tup] if tup in d else d.setdefault(tup, next(c)) for tup in tups]

# 10 loops, best of 3: 16.9 ms per loop

@Divakar

%timeit np.unique(get_row_view(np.array(tups)), return_inverse=1)[1]
# 10 loops, best of 3: 81 ms per loop

@self

%timeit pd.factorize([str(x) for x in tups])
10 loops, best of 3: 87.5 ms per loop

This is for small data, I'm certain this starts to look different with larger data. — piRSquared, May 26 '17 at 19:03

score 0 · Answer 7 · answered Mar 27 '21 at 11:29

You can use SKLearn's MultiLabelBinarizer, which will provide you a series of binary encodings:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
codes = mlb.fit_transform(np.array(tups)) # Must be passed as an array

>>> codes
array([[1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 1, 0, 0],
       [0, 0, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 1, 1, 0, 0],
       [0, 0, 1, 1, 0, 0, 0, 0, 0, 0]])

These can be transformed to decimals (if desired) with np.packbits(codes):

array([192,   0, 195,   0,  34,   4,  64, 195,   0], dtype=uint8)

How to I factorize a list of tuples?

7 Answers7

Linked

Related