0

I have a numpy array of sample pairs (2-D) and an array of samples (1-D). I want to convert the sample pairs to a matching array (i.e. 2-D) representing the indices of the sample array. Is there a faster solution than what I have already employed?

import numpy as np
pair_list = np.array([['samp1', 'samp4'],
                ['samp2', 'samp7'],
                ['samp2', 'samp4']])
samples = np.array(['samp0', 'samp1', 'samp2', 'samp3', 'samp4', 'samp5',
                 'samp6', 'samp7', 'samp8', 'samp9'])

vfunc = np.vectorize(lambda s: np.where(samples == s)[0])
pair_indices = vfunc(pair_list)

In [180]: print(pair_indices)
[[1 4]
 [2 7]
 [2 4]]
user3329732
  • 346
  • 2
  • 15

2 Answers2

2

I suggest you to use dictionaries because of its performant time complexity.

>>> import numpy as np
>>> pair_list = np.array([['samp1', 'samp4'],
                ['samp2', 'samp7'],
                ['samp2', 'samp4']])
>>> samples = {'samp0':0, 'samp1':1, 'samp2':2, 'samp3':3, 'samp4':4, 'samp5':5,
                 'samp6':6, 'samp7':7, 'samp8':8, 'samp9':9}
>>> vfunc = np.vectorize(lambda x: samples[x])
>>> pair_indices = vfunc(pair_list)
>>> print(pair_indices)
[[1 4]
 [2 7]
 [2 4]]
Mohsen_Fatemi
  • 2,183
  • 2
  • 16
  • 25
1
pair_list = np.array([['samp1', 'samp4'],
            ['samp2', 'samp7'],
            ['samp2', 'samp4']])
samples = np.array(['samp0', 'samp1', 'samp2', 'samp3', 'samp4', 'samp5',
             'samp6', 'samp7', 'samp8', 'samp9'])

def f1(pair_list,samples):
    vfunc = np.vectorize(lambda s: np.where(samples == s)[0])
    return vfunc(pair_list)

def f2(pair_list,samples):
    d = dict()
    for idx,el in enumerate(samples): d[el]=idx
    return np.array([d[el] for row in pair_list for el in row]).reshape(pair_list.shape[0],2)

f2 looks clumsy, but...

timeit f1(pair_list,samples)
25.7 µs ± 78 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

timeit f2(pair_list,samples)
9.09 µs ± 68.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Try it on your machine and see how it goes for you! Of course, it'll be even better if you have the ability to reuse samples, since in that case you only have to convert samples to a dict once.

Edit: It's much, much better to vectorize dict access, as suggested by Mohsen_Fatemi, even if samples can't be reused.

def f3(pair_list,samples):
    d = dict()
    for idx,el in enumerate(samples): d[el]=idx
    vfunc = np.vectorize(lambda x: d[x])
    return vfunc(pair_list)

timeit f3
16.1 ns ± 0.0138 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
Mark Snyder
  • 1,635
  • 3
  • 12