5
import numpy as np
data = np.array(['b','b','b','a','a','a','a','c','c','d','d','d'])

I need to replace each group of strings with an integer incrementally like this

data = np.array([0,0,0,1,1,1,1,2,2,3,3,3])

I'm looking for a numpy solution


With this dataset http://www.uploadmb.com/dw.php?id=1364341573

import numpy as np
f = open('test.txt','r')
lines = np.array([ line.strip() for line in f.readlines() ])
lines100 = lines[0:100]
_, ind, inv = np.unique(lines100, return_index=True, return_inverse=True)
print ind
print inv
nums = np.argsort(ind)[inv]
print nums

[ 0 83 62 40 19]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3
 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4]

lines200 = lines[0:200]
_, ind, inv = np.unique(lines200, return_index=True, return_inverse=True)
print ind
print inv
nums = np.argsort(ind)[inv]
print nums
[167   0  83 124 104 144 185  62  40  19]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
 9 9 9 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 7 7 7 7 7 7 7 7 7 7 7 7
 7 7 7 7 7 7 7 7 7 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4
 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 5 5 5 5
 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6]
[9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
 6 6 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 5 5 5 5 5 5 5 5 5 5 5
 5 5 5 5 5 5 5 5 5 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 4 4 4 4
 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3]
siamii
  • 23,374
  • 28
  • 93
  • 143

2 Answers2

4

EDIT: This doesn't always work:

>>> a,b,c = np.unique(data, return_index=True, return_inverse=True)
>>> c # almost!!!
array([1, 1, 1, 0, 0, 0, 0, 2, 2, 3, 3, 3])
>>> np.argsort(b)[c]
array([0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 3], dtype=int64)

But this does work:

def replace_groups(data):
    a,b,c, = np.unique(data, True, True)
    _, ret = np.unique(b[c], False, True)
    return ret

and is faster than the dictionary replacement approach, about 33% for larger datasets:

def replace_groups_dict(data):
    _, ind = np.unique(data, return_index=True)
    unqs = data[np.sort(ind)]
    data_id = dict(zip(unqs, np.arange(data.size)))
    num = np.array([data_id[datum] for datum in data])
    return num

In [7]: %timeit replace_groups_dict(lines100)
10000 loops, best of 3: 68.8 us per loop

In [8]: %timeit replace_groups_dict(lines200)
10000 loops, best of 3: 106 us per loop

In [9]: %timeit replace_groups_dict(lines)
10 loops, best of 3: 32.1 ms per loop

In [10]: %timeit replace_groups(lines100)
10000 loops, best of 3: 67.1 us per loop

In [11]: %timeit replace_groups(lines200)
10000 loops, best of 3: 78.4 us per loop

In [12]: %timeit replace_groups(lines)
10 loops, best of 3: 23.1 ms per loop
Jaime
  • 65,696
  • 17
  • 124
  • 159
  • Doesn't work for `data = np.array([3,1,2])` (see @DSM's comment on my answer) – askewchan Mar 29 '13 at 20:31
  • 1
    @askewchan Weird, I still don't fully understand why the idea we both had sometimes works, but sometimes doesn't... It seems though that it is a little faster, to do a second call to `np.unique` rather than your rerplacement dictionary, see my edit. – Jaime Mar 29 '13 at 21:16
  • I like both solutions. I accept this because this is pure numpy and faster a bit. However, I don't fully understand it. Can you write a short explanation? – siamii Mar 29 '13 at 21:37
  • @siamii Basically our original coincident ideas sorted assuming that the `np.unique` return value didn't rearrange (like your earlier question). So you have to do it in two steps: First, to get the uniques unrearranged, then to replace them as we were trying to before. – askewchan Mar 29 '13 at 21:41
  • @Jaime I think it fails because each call to `np.unique` doesn't preserve the original order of the array, as in: http://stackoverflow.com/q/15637336/1730674 and http://stackoverflow.com/q/15649097/1730674 – askewchan Mar 29 '13 at 21:42
  • @askewchan so in this case it didn't rearrange the second time np.unique is called? What if sometimes the second np.unique also rearranges. Is this guaranteed to work? – siamii Mar 29 '13 at 22:02
  • I would expect it to be reliable, but to be honest, I am losing all trust in `np.unique` :P Maybe @Jaime has a better idea. – askewchan Mar 29 '13 at 22:07
  • @askewchan Been looking over your detective work on that other answer... Glad I missed that one! :-) The second call to `np.unique` is not asking for `return_index`, so the bug shouldn't affect it, but I must admit I am still not really sure I fully understand what is going on here... – Jaime Mar 29 '13 at 23:04
3

Given @DSM's noticing that my original idea doesn't work robustly, the best solution I can think of is a replacement dictionary:

data = np.array(['b','b','b','a','a','a','a','c','c','d','d','d'])
_, ind = np.unique(data, return_index=True)
unqs = data[np.sort(ind)]
data_id = dict(zip(unqs, np.arange(data.size)))
num = np.array([data_id[datum] for datum in data])

for the month data:

In [5]: f = open('test.txt','r')

In [6]: data = np.array([line.strip() for line in f.readlines()])

In [7]: _, ind, inv  = np.unique(data, return_index=True)

In [8]: months = data[np.sort(ind)]

In [9]: month_id = dict(zip(months, np.arange(months.size)))

In [10]: np.array([month_id[datum] for datum in data])
Out[10]: array([ 0,  0,  0, ..., 41, 41, 41])
askewchan
  • 45,161
  • 17
  • 118
  • 134
  • if you still have my test data from last time, can you check it with that? I believe there's another bug in np.argsort. – siamii Mar 29 '13 at 19:44
  • All the months? Did you ever upgrade numpy? – askewchan Mar 29 '13 at 19:46
  • I fixed the bug in the source, because I can't upgrade now, so the months are correctly returned. This seems to be independent of that – siamii Mar 29 '13 at 19:50
  • No, I don't get the correct results either... looking into it, but my edit has a workaround that does work, but might be slow. – askewchan Mar 29 '13 at 19:57
  • 2
    If this approach works, then shouldn't `a,b,c = np.unique([3,1,2], True, True); print np.argsort(b)[c]` give `[0, 1, 2]`? Doesn't seem to, though. – DSM Mar 29 '13 at 20:14