How to fit a huge distance matrix into a memory?

Question

I have a huge distance matrix of size aroud 590000 * 590000 (data type of each element is float16). Will it fit in Memory for clustering algorithm ?? If not could anyone give an idea of using it in clustering DBSCAN algorithm??

How comes that you "have" this matrix, but don't know it's size and if it fits into memory? — Has QUIT--Anony-Mousse, Jun 30 '19 at 06:58

politinsa · Answer 1 · 2019-06-24T09:29:28.007

590000 * 590000 * 2 bytes (float16 size) = 696.2 GB of RAM

It won't fit in memory with a standard computer. Moreover, float16 are converted to float32 in order to perform computations (see Python numpy float16 datatype operations, and float8?), so it might use a lot more than 700GB of RAM.

Why do you have a square matrix ? Can't you use a condensed matrix ? It will use half the memory needed with a square matrix.

Joe · Answer 2 · 2019-06-24T13:50:53.653

0

Clustering (creating chunks) to decrease the problem size for DBSCAN can e.g. be done by having areas with overlapping regions.

The size of the overlapping regions has to fit your problem.

Find a reasonable size for the chunks of your problem and the overlapping region.

Afterwards stitch the results manually by iterating and comparing the clusters found in the overlapping regions.

You have to check if the elements in one cluster are also present in other chunks.

You might have to apply some stitching parameters, e.g. if some number of elements are in clusters in two different chunks they are the same cluster.

I just saw this:

The problem apparently is a non-standard DBSCAN implementation in scikit-learn. DBSCAN does not need a distance matrix.

But this has probably been fixed years ago.

Which implementation are you using?

edited Jun 24 '19 at 13:50

answered Jun 24 '19 at 12:39

Joe

6,758
2
26
47

BUt if i do so the concept of core point in dbscan will be lost. for example if i have a non core point in chunk 1. and what if it is a core point when you consider the whole dataset. – Vas Jun 24 '19 at 13:37
You can and probably have to optimize this approach, e.g. by making a second iteration where you center the chunks on the clusters found. You could dynamically grow a chunk until there are no more new points added when increasing the size. – Joe Jun 24 '19 at 13:45

score 0 · Answer 3 · answered Jun 30 '19 at 07:01

DBSCAN only needs the neighbors of each point.

So if you would know the appropriate parameters (which I doubt), you could read the huge matrix one row at a time, and build a list of neighbors within your distance threshold. Assuming that less than 1% are neighbors (on such huge data, you'll likely want to go even lower) that would reduce the memory need 100x.

But usually you want to avoid computing such a matrix at all!

How to fit a huge distance matrix into a memory?

3 Answers3