2

Hi I am trying to plot a numpy array of strings in y axis, for example

arr = np.array(['a','a','bas','dgg','a']) #The actual strings are about 11 characters long

vs a float array with equal length. The string array I am working with is very large ~ 100 million entries. One of the solutions I had in mind was to convert the string array to unique integer ids, for example,

vocab = np.unique(arr)
vocab = list(vocab)
arrId = np.zeros(len(arr))
for i in range(len(arr)):
    arrId[i] = vocab.index(arr[i])

and then matplotlib.pyplot.plot(arrId). But I cannot afford to run a for loop to convert the array of strings to an array of unique integer ids. In an initial search I could not find a way to map strings to an unique id without using a loop. Maybe I am missing something, but is there a smart way to do this in python?

EDIT -

Thanks. The solutions provided use vocab,ind = np.unique(arr, return_index = True) where idx is the returned unique integer array. But it seems like np.unique is O(N*log(N)) according to this ( numpy.unique with order preserved), but pandas.unique is of order O(N). But I am not sure how to get ind from pandas.unique. plotting data i guess can be done in O(N). So I was wondering is there a way to do this O(N)? perhaps by hashing of some sort?

  • 1
    Would the [`collections.Counter`](https://docs.python.org/3/library/collections.html#collections.Counter) be anything that would interest you? I think it is somewhat faster than your method, but I'm not sure. – Kendas Jun 02 '17 at 12:51

3 Answers3

2

numpy.unique used with the return_inverse argument allows you to obtain the inverted index.

arr = np.array(['a','a','bas','dgg','a'])
unique, rev = np.unique(arr, return_inverse=True)

#unique: ['a' 'bas' 'dgg']
#rev: [0 0 1 2 0]

such that unique[rev] returns the original array ['a' 'a' 'bas' 'dgg' 'a'].

This can be easily used to plot the data.

import numpy as np
import matplotlib.pyplot as plt

arr = np.array(['a','a','bas','dgg','a'])
x = np.array([1,2,3,4,5])

unique, rev = np.unique(arr, return_inverse=True)
print unique
print rev
print unique[rev]

fig,ax=plt.subplots()
ax.scatter(x, rev)
ax.set_yticks(range(len(unique)))
ax.set_yticklabels(unique)

plt.show()

enter image description here

ImportanceOfBeingErnest
  • 321,279
  • 53
  • 665
  • 712
  • Thanks a lot, an elegant solution! didnt know about return inverse. Feel stupid. – Siddhartha Satpathi Jun 02 '17 at 14:15
  • You don't need to feel stupid because not knowing something. Aquiring knowledge is often about asking the right questions (whether it is asking google or if that doesn't help, asking here). – ImportanceOfBeingErnest Jun 02 '17 at 14:20
  • My only comment would be that if the difference between O(N) and O(Nlog(N)) matters for you, you're on the wrong track using a matplotlib scatterplot. I somehow doubt that plotting that many points such that speed would matter will then produce a meaningful plot. – ImportanceOfBeingErnest Jun 03 '17 at 21:22
0

you can factorize your strings:

In [75]: arr = np.array(['a','a','bas','dgg','a'])

In [76]: cats, idx = np.unique(arr, return_inverse=True)

In [77]: plt.plot(idx)
Out[77]: [<matplotlib.lines.Line2D at 0xf82da58>]

In [78]: cats
Out[78]:
array(['a', 'bas', 'dgg'],
      dtype='<U3')

In [79]: idx
Out[79]: array([0, 0, 1, 2, 0], dtype=int64)
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
0

You can use the numpy unique funciton to return a unique array of values?

print(np.unique(arr))

['a' 'bas' 'dgg']

collections.counter also return the value and number of counts:

print(collections.Counter(arr))
Counter({'a': 3, 'bas': 1, 'dgg': 1})

Does this help at all?

samocooper
  • 82
  • 6
  • Sorry then define the unique array as a dictionary to give a unique integer value for each string. – samocooper Jun 02 '17 at 12:56
  • yes basically i wanted to know a way to define the unique integer array without using a for loop. I am not sure how to do it after getting the dictionary. Could you please elaborate what you had in mind? – Siddhartha Satpathi Jun 02 '17 at 14:18