Hi I am trying to plot a numpy array of strings in y axis, for example
arr = np.array(['a','a','bas','dgg','a']) #The actual strings are about 11 characters long
vs a float array with equal length. The string array I am working with is very large ~ 100 million entries. One of the solutions I had in mind was to convert the string array to unique integer ids, for example,
vocab = np.unique(arr)
vocab = list(vocab)
arrId = np.zeros(len(arr))
for i in range(len(arr)):
arrId[i] = vocab.index(arr[i])
and then matplotlib.pyplot.plot(arrId)
. But I cannot afford to run a for loop to convert the array of strings to an array of unique integer ids. In an initial search I could not find a way to map strings to an unique id without using a loop. Maybe I am missing something, but is there a smart way to do this in python?
EDIT -
Thanks. The solutions provided use vocab,ind = np.unique(arr, return_index = True)
where idx
is the returned unique integer array. But it seems like np.unique is O(N*log(N)) according to this ( numpy.unique with order preserved), but pandas.unique is of order O(N). But I am not sure how to get ind
from pandas.unique. plotting data i guess can be done in O(N). So I was wondering is there a way to do this O(N)? perhaps by hashing of some sort?