I would like to load a huge matrix from a parquet file and distribute the distance computation across several nodes in order to both save memory and speedup the computing.
So the input data own 42 000 rows (features) and 300 000 columns (samples):
X | sample1 | sample2 | sample3 |
---|---|---|---|
feature1 | 0 | 1 | 1 |
feature2 | 1 | 0 | 1 |
feature3 | 0 | 0 | 1 |
header column and row are put here to describe the input data
So I own also a list of samples [sample1,sample2,sample3…]
which could help (by the use of itertools.combinations
or others)
I would like to apply a commutative function over each pair of samples. With pandas, I do this:
similarity = df[df[sample1] == df[sample2]][sample1].sum()
dissimilarity = df[df[sample1] != df[sample2]][sample1].sum()
score = similarity - dissimilarity
So is it possible using both ray and broadcasting method from numpy to speedup the computation ?
The @Jaime answer's is really close of my needs.
Maybe I could do n batch of samples using:
batch1=[sample1,samlpe2,…]
data = pandas.read_parquet(somewhere, column=batch1 ).to_numpy()
Thanks for your help
Note 1: Input data of 10 samples can be emulated like this:
import random
import numpy as np
foo = np.array([[random.randint(0,1) for _ in range(0,10)] for _ in range(0,30000)])
Note 2: I tried spatial distance from scipy on one node but I got not enough memory. This is why I would like to split the computation over several nodes