1

I use cosine_similarity on matrices and wondered about the necessary memory. So I created a small snippet:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
n = 10000
mat = np.random.random((n, n))
sim = cosine_similarity(mat)

With n growing, of course the matrix gets way bigger. I expect the size of the matrix to be n**2 * 4 bytes, meaning:

  • n = 10,000: 400MB
  • n = 15,000: 900MB
  • n = 20,000: 1.6GB

I observe WAY more memory usage. My system has 16GB and it crashes for n = 20,000. Why is that the case?

What I tried

I have seen How do I profile memory usage in Python?. So I installed memory-profiler and executed

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

@profile
def cos(n):
    mat = np.random.random((n, n))
    sim = cosine_similarity(mat)
    return sim

sim = cos(n=10000)

with

python3 -m memory_profiler memory_usage_cosine_similarity.py

and got

Line #    Mem usage    Increment   Line Contents
================================================
     4   62.301 MiB   62.301 MiB   @profile
     5                             def cos(n):
     6  825.367 MiB  763.066 MiB       mat = np.random.random((n, n))
     7 1611.922 MiB  786.555 MiB       sim = cosine_similarity(mat)
     8 1611.922 MiB    0.000 MiB       return sim

but I'm confused here about most things:

  • Why is @profile at 62.301 MiB (so big)?
  • Why is mat 825 MiB instead of 400 MB?
  • Why is sim a different size than mat?
  • Why is htop showing me an increas from 3.1 to 5.5 GB (2.4 GB), but the profiler says it needs only 1.6GB?

Ignoring this, this is what happens when I increase n:

n                cosine_similarity
1000              11.684 MiB
2000 (x2)         37.547 MiB (x 3.2)
4000 (x4)        134.027 MiB (x11.5)
8000 (x8)        508.316 MiB (x43.5)

So cos cosine_similarity shows roughly O(n**1.8) behaviour.

If I don't use n x n matrices, but n x 100 instead, I get similar numbers:

n                cosine_similarity
1000               9.512 MiB MiB
2000 (x2)         33.543 MiB (x 3.5)
4000 (x4)        127.152 MiB (x13.4)
8000 (x8)        496.234 MiB (x52.2)
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958

0 Answers0