computing efficiently pairwise similarity/dissimilarity with ray and numpy

Question

I would like to load a huge matrix from a parquet file and distribute the distance computation across several nodes in order to both save memory and speedup the computing.

So the input data own 42 000 rows (features) and 300 000 columns (samples):

X	sample1	sample2	sample3
feature1	0	1	1
feature2	1	0	1
feature3	0	0	1

header column and row are put here to describe the input data

So I own also a list of samples [sample1,sample2,sample3…] which could help (by the use of itertools.combinations or others)

I would like to apply a commutative function over each pair of samples. With pandas, I do this:

similarity = df[df[sample1] == df[sample2]][sample1].sum()
dissimilarity = df[df[sample1] != df[sample2]][sample1].sum()
score = similarity - dissimilarity

So is it possible using both ray and broadcasting method from numpy to speedup the computation ?

The @Jaime answer's is really close of my needs.

Maybe I could do n batch of samples using:

batch1=[sample1,samlpe2,…]
data = pandas.read_parquet(somewhere, column=batch1 ).to_numpy()

Thanks for your help

Note 1: Input data of 10 samples can be emulated like this:

import random
import numpy as np
foo = np.array([[random.randint(0,1) for _ in range(0,10)] for _ in range(0,30000)])

Note 2: I tried spatial distance from scipy on one node but I got not enough memory. This is why I would like to split the computation over several nodes

Just a minor comment - it's weird to have an array of shape N_features x N_samples; usually it's N_samples x N_features. — Stefan B, Feb 15 '21 at 16:37
anyway if for performance reason is it better N_features x N_samples. I would accept too the answer. — bioinfornatics, Feb 15 '21 at 17:08

Stefan B · Accepted Answer · 2021-02-15T18:05:18.740

Just pitching some ideas here outlining the difficulties / best (?) way to calculate similarity:

import itertools as it
import numpy as np

n_samples, n_features = 42_000, 300_000

# usually the other way around
data = np.random.randint(0, 2, size=(n_samples, n_features), dtype=np.uint8)
# 42,000 * 300,000 = 12,600,000,000; already 12.6GB RAM just to load the entire data

# your similarity score can at best be n_features
# each sample has perfect similarity to itself
# storing each similarity in a matrix needs at least 
# 300,000² = 90,000,000,000 * 4 (np.int32) = 360 GB of RAM
# np.int16 (-32768, 32767) won't be enough;
sim_mat = np.eye(n_samples, dtype=np.int32) * n_features

# fastest way of computing similarity I could come up with
# sim = (np.sum(data[i] == data[j]) - n_features/2) * 2
# same as np.sum(data[i] == data[j]) - np.sum(data[i] != data[j])

baseline = n_features/2
for i, j in it.combinations(range(n_samples), 2):
    sim_mat[i, j] = sim_mat[j, i] = (np.sum(data[i] == data[j]) - baseline) * 2

some helper functions that might be useful:

def similarity_from_to(data: np.ndarray, from_i: int, to_i: int) -> int:
    """
    Computes similarities from sample `data[from_i]` to sample `data[to_i]`

    Parameters
    ----------
    data : np.ndarray
        2D data matrix of shape (N_samples, N_features)
    from_i : int
        index of first sample in [0, N_samples)
    to_i : int
        index of second sample in [0, N_samples)

    Returns
    -------
    similarity : int
        similarity-score in [-N_features/2, N_features]
    """
    return int((np.sum(data[from_i] == data[to_i]) - data.shape[1]/2) * 2)

def similarities_from(data: np.ndarray, from_i: int):
    """
    Computes similarities from sample `from_i` to all other samples

    Parameters
    ----------
    data : np.ndarray
        2D data matrix of shape (N_samples, N_features)
    from_i : int
        index of target sample in [0, N_samples)

    Returns
    -------
    similarities : np.ndarray
        similarity-scores for all samples to data[`from_i`]; in shape (N_samples, )
    """
    baseline = n_features/2
    return np.asarray([(np.sum(data[from_i] == data[to_i]) - baseline) * 2 for to_i in range(len(data))], dtype=np.int32)

thanks @stefan-b for your well described snippet, 1) I think the use of `it.combination(range(n_samples),2))` allow to not use the if statement `if i != j` maybe it is done internally in such case they are any benefits. 2) is it possible to not call the python for loop and use a broadcast operation ? — bioinfornatics, Feb 15 '21 at 17:37

computing efficiently pairwise similarity/dissimilarity with ray and numpy

1 Answers1