I am trying to cluster my dataset in Python but I am getting Memory Error. Dataset has 442.000 rows and 12 columns(all float). The code is below:
import psycopg2 as pg
import dask.dataframe as dd
from sklearn.cluster import AgglomerativeClustering
connection = pg.connect("host='...' dbname=... user=... password='...' port = '...'")
df = dd.read_sql_table('table', 'postgresql+psycopg2://...', index_col='id', schema='...')
hc = AgglomerativeClustering(n_clusters = 20, affinity = 'l1', linkage = 'average')
labels = hc.fit_predict(df)
Here is the error:
MemoryError: Unable to allocate 729. GiB for an array with shape (97884319653,) and data type float64
I tried different values for parameters like npartitions and bytes_per_chunk. I also tried the accepted answer from this question: Unable to allocate array with shape and data type but looks like anaconda refuses to compile the code. The program seems like running with no problem but when I try to call labels
, it throws error:
NameError: name 'labels' is not defined
OS: Linux Mint, RAM: 16 GB