0

I tried the following agglomerative clustering in the Jupyter notebook. The shape of my dataset is (406829, 8).

I Tried the following code:

import pandas as pd
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
import os
from sklearn.preprocessing import StandardScaler, LabelEncoder
import scipy.cluster.hierarchy as shc
from sklearn.cluster import AgglomerativeClustering


# Apply the agglomerative clustering with ward linkage
aggloclust = AgglomerativeClustering(affinity='euclidean',linkage='ward', memory=None, n_clusters=5).fit(data)
print(aggloclust)

# Agglomerative clustering labels
labels = aggloclust.labels_

# Show the clusters on the graph
plt.scatter(x[:,0], x[:,1], c=labels)
plt.show()

Then I ran into an error - MemoryError: Unable to allocate 617. GiB for an array with shape (82754714206,) and data type float64

I am working on windows machine with 16GB RAM. Python version - 3.8.5 Can anyone tell me how can I resolve this issue.

I tried to google this error and got the solution - To create the jupyter config file and then to update the max_buffer_size in that file I found it here - How to increase Jupyter notebook Memory limit?

I tried the solution provided in the above link but it did not work. Please help me.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Simran Kaur
  • 217
  • 1
  • 5
  • 15
  • 3
    You cannot magically get 617 GB of memory by increasing a number in a configuration file if you have only 16 GB physical RAM. – mkrieger1 May 30 '21 at 14:28
  • Where does the number 82754714206 come from? – mkrieger1 May 30 '21 at 14:32
  • 1
    @mkrieger1 it's `sum(range(406829))` - number of elements in efficiently stored distance matrix for OP's `406829` examples. It's `sum(range(406829))` instead of `406829 ** 2` as duplicated distances (distance between 2 points is symmetric) are not stored, so you need roughly only half of the matrix. – hnwoh Aug 01 '23 at 14:15

1 Answers1

1

Memory consumption of AgglomerativeClustering is O(n²), it means it grows exponentially compared to data size. With single linkage, the computation can be made faster from O(n³) to O(n²) but unfortunately this does not apply to memory [1]. Single clustering also has down sides of "rich get richer" kind of behavior where the clusters tend to have only a few big ones and others near to zero size clusters [2]. So, at least inside scipy or scikit options on fine tuning are not good.

Another option would be have less input data when fitting the model (= making the training). For that you could use for data frame a method (assuming the data object is a dataframe):

data.sample(frac = 0.5) 

which shrinks the size exponentially down for the memory usage. Don't use a lot of data at first. From [3]:

I ran the algorithm on 0.02% of my data and I got the result but the problem raised when I need to label all records.

Sources:

[1] http://wwwens.aero.jussieu.fr/lefrere/master/SPE/docs-python/scipy-doc/generated/scipy.cluster.hierarchy.linkage.html

[2] https://scikit-learn.org/stable/modules/clustering.html

[3] https://datascience.stackexchange.com/questions/47889/how-to-run-agglomerativeclustering-on-a-big-data-in-python

mico
  • 12,730
  • 12
  • 59
  • 99