6

I'm trying to apply a function to each pair of columns in a numpy array (each column is an individual's genotype).

For example:

[48]: g[0:10,0:10]

array([[ 1,  1,  1,  1,  1,  1,  1,  1,  1, -1],
      [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
      [ 1,  1,  1,  1,  1,  1, -1,  1,  1,  1],
      [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
      [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
      [ 1,  1,  1,  1,  1,  1,  1,  1,  1, -1],
      [-1, -1,  0, -1, -1, -1, -1, -1, -1,  0],
      [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
      [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
      [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1]], dtype=int8)

My goal is to produce a distance matrix d so that each element of d is the pairwise distance comparing each column in g.

d[0,1] = func(g[:,0], g[:,1])

Any ideas would be fantastic! Thank you!

Héctor M.
  • 2,302
  • 4
  • 17
  • 35
ksw
  • 313
  • 3
  • 11
  • 3
    How does the distance function work? – Mazdak Apr 13 '18 at 16:35
  • It takes 2 numpy arrays to compare them: def count_snp_diffs(x, y): return np.count_nonzero((x != y) & (x >= 0) & (y >= 0)) . Courtesy of http://alimanfoo.github.io/ – ksw Apr 13 '18 at 16:37

3 Answers3

2

You can simply define the function as:

def count_snp_diffs(x, y): 
    return np.count_nonzero((x != y) & (x >= 0) & (y >= 0),axis=0)

And then call it using as index an array generated with itertools.combinations, in order to get all possible column combinations:

combinations = np.array(list(itertools.combinations(range(g.shape[1]),2)))
dist = count_snp_diffs(g[:,combinations[:,0]], g[:,combinations[:,1]])

In addition, if the output must be stored in a matrix (which for large g is not recomendes because only the upper triangle will be filled and all the rest will be useless info, this can be achieved with the same trick:

d = np.zeros((g.shape[1],g.shape[1]))
combinations = np.array(list(itertools.combinations(range(g.shape[1]),2)))
d[combinations[:,0],combinations[:,1]] = count_snp_diffs(g[:,combinations[:,0]], g[:,combinations[:,1]])

Now, d[i,j] returns the distance between columns i and j (whereas d[j,i] is a zero). This approach relies in the fact that arrays can be indexed with lists or arrays containing repeated indexes:

a = np.arange(3)+4
a[[0,1,1,1,0,2,1,1]]
# Out
# [4, 5, 5, 5, 4, 6, 5, 5]

Here is one step by step explanation of what is happening.

Calling g[:,combinations[:,0]] accesses all the columns in the first clumn of permutations, generating a new array, which is compared column by column with the array generated with g[:,combinations[:,1]]. Thus, A boolean array diff is generated. If g had 3 columns it would look like this, where each column is the comparison of columns 0,1, 0,2 and 1,2:

[[ True False False]
 [False  True False]
 [ True  True False]
 [False False False]
 [False  True False]
 [False False False]]

And finally, the values for each column are added:

np.count_nonzero(diff,axis=0)
# Out
# [2 3 0]

In addition, due to the fact that the boolean class in python inherits from integer class (roughly False==0 and and True==1, see this answer of "Is False == 0 and True == 1 in Python an implementation detail or is it guaranteed by the language?" for more info). The np.count_nonzero adds 1 for each True position, which is the same result obtained with np.sum:

np.sum(diff,axis=0) 
# Out
# [2 3 0]

Notes on performance and memory

For large arrays, working with the whole array at a time can require too much memory and you can get a Memory Error, however, for small or medium arrays, it tends to be the fastest approach. In some cases it can be useful to work by chunks:

combinations = np.array(list(itertools.combinations(range(g.shape[1]),2)))
n = len(combinations)
dist = np.empty(n)
# B = np.zeros((g.shape[1],g.shape[1]))
chunk = 200
for i in xrange(chunk,n,chunk):
    dist[i-chunk:i] = count_snp_diffs(g[:,combinations[i-chunk:i,0]], g[:,combinations[i-chunk:i,1]])
    # B[combinations[i-chunk:i,0],combinations[i-chunk:i,1]] = count_snp_diffs(g[:,combinations[i-chunk:i,0]], g[:,combinations[i-chunk:i,1]])
dist[i:] = count_snp_diffs(g[:,combinations[i:,0]], g[:,combinations[i:,1]])
# B[combinations[i:,0],combinations[i:,1]] = count_snp_diffs(g[:,combinations[i:,0]], g[:,combinations[i:,1]])

For g.shape=(300,N), the execution times reported by %%timeit in my computer with python 2.7, numpy 1.14.2 and allel 1.1.10 are:

  • 10 columns
    • numpy + matrix storage: 107 µs
    • numpy + 1D storage : 101 µs
    • allel : 247 µs
  • 100 columns
    • numpy + matrix storage: 15.7 ms
    • numpy + 1D storage : 16 ms
    • allel : 22.6 ms
  • 1000 columns
    • numpy + matrix storage: 1.54 s
    • numpy + 1D storage : 1.53 s
    • allel : 2.28 s

With these array dimensions, pure numpy is a litle faster than allel module, but the computation time should be checked for the dimensions in your problem.

OriolAbril
  • 7,315
  • 4
  • 29
  • 40
0

You can create your expected pairs using np.dstack and then apply the function on third axis using np.apply_along_axis.

new = np.dstack((arr[:,:-1], arr[:, 1:]))
np.apply_along_axis(np.sum, 2, new)

Example :

In [86]: arr = np.array([[ 1,  1,  1,  1,  1,  1,  1,  1,  1, -1],
    ...:        [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
    ...:        [ 1,  1,  1,  1,  1,  1, -1,  1,  1,  1],
    ...:        [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
    ...:        [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
    ...:        [ 1,  1,  1,  1,  1,  1,  1,  1,  1, -1],
    ...:        [-1, -1,  0, -1, -1, -1, -1, -1, -1,  0],
    ...:        [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
    ...:        [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
    ...:        [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1]], dtype=np.int8)
    ...:        
    ...:        

In [87]: new = np.dstack((arr[:,:-1], arr[:, 1:]))

In [88]: new
Out[88]: 
array([[[ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1,  1],
        [ 1, -1]],

    ...

In [89]: 

In [89]: np.apply_along_axis(np.sum, 2, new)
Out[89]: 
array([[ 2,  2,  2,  2,  2,  2,  2,  2,  0],
       [ 2,  2,  2,  2,  2,  2,  2,  2,  2],
       [ 2,  2,  2,  2,  2,  0,  0,  2,  2],
       [ 2,  2,  2,  2,  2,  2,  2,  2,  2],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 2,  2,  2,  2,  2,  2,  2,  2,  0],
       [-2, -1, -1, -2, -2, -2, -2, -2, -1],
       [ 2,  2,  2,  2,  2,  2,  2,  2,  2],
       [ 2,  2,  2,  2,  2,  2,  2,  2,  2],
       [ 2,  2,  2,  2,  2,  2,  2,  2,  2]])
Mazdak
  • 105,000
  • 18
  • 159
  • 188
0

Thank you for suggestions! I've just been told that this is possible with scikit-allel: https://scikit-allel.readthedocs.io/en/latest/ where you can define your own distance matrix to be performed on pairwise combinations of columns in a 2-D numpy array:

dist = allel.pairwise_distance(g, metric=count_snp_diffs)

Thank you for your help!

http://alimanfoo.github.io/2016/06/10/scikit-allel-tour.html

ksw
  • 313
  • 3
  • 11
  • Oh! Now I understand properly the desired output. I will edit my answer. A priori, I expect `numpy` to be a little faster, I don't know if this is an important factor for you – OriolAbril Apr 13 '18 at 18:29
  • I have edited the answer. There is some analysis of the computation time spent by each of them, however, you should check that with your `g` array, if it is really large, you may get `Memory Error` or end up being slower using numpy – OriolAbril Apr 13 '18 at 19:38