I use cosine_similarity
on matrices and wondered about the necessary memory. So I created a small snippet:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
n = 10000
mat = np.random.random((n, n))
sim = cosine_similarity(mat)
With n
growing, of course the matrix gets way bigger. I expect the size of the matrix to be n**2 * 4
bytes, meaning:
- n = 10,000: 400MB
- n = 15,000: 900MB
- n = 20,000: 1.6GB
I observe WAY more memory usage. My system has 16GB and it crashes for n = 20,000. Why is that the case?
What I tried
I have seen How do I profile memory usage in Python?. So I installed memory-profiler and executed
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
@profile
def cos(n):
mat = np.random.random((n, n))
sim = cosine_similarity(mat)
return sim
sim = cos(n=10000)
with
python3 -m memory_profiler memory_usage_cosine_similarity.py
and got
Line # Mem usage Increment Line Contents
================================================
4 62.301 MiB 62.301 MiB @profile
5 def cos(n):
6 825.367 MiB 763.066 MiB mat = np.random.random((n, n))
7 1611.922 MiB 786.555 MiB sim = cosine_similarity(mat)
8 1611.922 MiB 0.000 MiB return sim
but I'm confused here about most things:
- Why is
@profile
at 62.301 MiB (so big)? - Why is
mat
825 MiB instead of 400 MB? - Why is
sim
a different size thanmat
? - Why is
htop
showing me an increas from 3.1 to 5.5 GB (2.4 GB), but the profiler says it needs only 1.6GB?
Ignoring this, this is what happens when I increase n:
n cosine_similarity
1000 11.684 MiB
2000 (x2) 37.547 MiB (x 3.2)
4000 (x4) 134.027 MiB (x11.5)
8000 (x8) 508.316 MiB (x43.5)
So cos cosine_similarity shows roughly O(n**1.8) behaviour.
If I don't use n x n matrices, but n x 100 instead, I get similar numbers:
n cosine_similarity
1000 9.512 MiB MiB
2000 (x2) 33.543 MiB (x 3.5)
4000 (x4) 127.152 MiB (x13.4)
8000 (x8) 496.234 MiB (x52.2)