3

I am trying to cluster my dataset in Python but I am getting Memory Error. Dataset has 442.000 rows and 12 columns(all float). The code is below:

import psycopg2 as pg
import dask.dataframe as dd
from sklearn.cluster import AgglomerativeClustering

connection = pg.connect("host='...' dbname=... user=... password='...' port = '...'")
df = dd.read_sql_table('table', 'postgresql+psycopg2://...', index_col='id',  schema='...')
hc = AgglomerativeClustering(n_clusters = 20, affinity = 'l1', linkage = 'average')
labels = hc.fit_predict(df)

Here is the error:

MemoryError: Unable to allocate 729. GiB for an array with shape (97884319653,) and data type float64

I tried different values for parameters like npartitions and bytes_per_chunk. I also tried the accepted answer from this question: Unable to allocate array with shape and data type but looks like anaconda refuses to compile the code. The program seems like running with no problem but when I try to call labels, it throws error: NameError: name 'labels' is not defined

OS: Linux Mint, RAM: 16 GB

mcsahin
  • 63
  • 1
  • 7
  • 1
    Some troubleshooting suggestions: double check that the size of the data is what you expect it to be, take a small subset of the data and see how much memory the clustering takes up, take larger and larger subsets and see how big you can make it before running into an error again. – Acccumulation Apr 06 '21 at 15:30
  • I will try your suggestion and share the result as soon as possible. – mcsahin Apr 06 '21 at 15:32
  • I can select 39000 datas from dataset. If I select more it throws this error: MemoryError: unable to allocate array data. – mcsahin Apr 07 '21 at 06:45
  • I found this [RAPIDS cuml implementation of Agglomerative Clustering](https://docs.rapids.ai/api/cuml/stable/api.html#agglomerative-clustering) that might help here by leveraging parallelism. – pavithraes Oct 05 '21 at 06:20

0 Answers0