3

The following code tests KMeans for several n_clusters and tries to find the "best" n_clusters by the inertia criterion. However, it is not reproducible: even fixing random_state, every time I call kmeans(df) on the same dataset, it generates different clustering - and even different n_clusters. Am I missing something here?

from sklearn.cluster import KMeans
from tqdm import tqdm_notebook

def kmeans(df):
    inertia = []
    models = {}
    start = 3
    end = 40
    for i in tqdm_notebook(range (start, end)):
        k = KMeans(n_clusters=i, init='k-means++', n_init=50, random_state=10, n_jobs=-1).fit(df.values)        
        inertia.append(k.inertia_)
        models[i] = k
    ep = np.argmax(np.gradient(np.gradient(np.array(inertia)))) + start
    return models[ep]
Celso
  • 649
  • 1
  • 6
  • 15
  • https://stackoverflow.com/questions/25921762/changes-of-clustering-results-after-each-time-run-in-python-scikit-learn – PV8 Nov 19 '19 at 09:11
  • It is not always the same, thats the natural way of this algorithm..., if your results vary more, then this means that they are probably not clusterable – PV8 Nov 19 '19 at 09:12
  • `gradient(gradient (` is q poor and unreliable way of implementing the already poor and unreliable elbow criterion. Don't do this. In particular, not with double checking your results. – Has QUIT--Anony-Mousse Nov 20 '19 at 19:22
  • @PV8 he has set `random_state` so it *should* be deterministic. The error is probably somewhere else, such as the data set preparation. – Has QUIT--Anony-Mousse Nov 20 '19 at 19:24
  • I'm not sure the error is in data set preparation, as suggested by @Anony-Mousse. Like I said on my question, "every time I call kmeans(df) on the same dataset(...)." – Celso Nov 23 '19 at 01:16
  • Well, the only source of randomness here has been fixed. There is nothing we can do for you if it's not in the code shown. – Has QUIT--Anony-Mousse Nov 23 '19 at 09:02
  • Does n_jobs=1 help maybe? They you've got a race condition in sklearn that you should report *there*. – Has QUIT--Anony-Mousse Nov 23 '19 at 09:03

1 Answers1

1

I am having this same issue. I think that a closer solution is to freeze the model into a file and import the model and then cluster a new predict phrase, I think if the vectorizer and kmeans clustering is initialized every single time the program it will run, it seems to order the clusters in a different order every time and the hashmap will not activate correclty and give you a different number every time the function is called

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.utils import shuffle

# Sample array of string sentences

df = pd.read_csv('/workspaces/codespaces-flask//data/shuffled.csv')
df = shuffle(df)
sentences = df['text'].values
# Convert the sentences into TF-IDF features
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['text'])

# Perform K-Means clustering
kmeans = KMeans(n_clusters=8, random_state=42)
clusters = kmeans.fit_predict(X)
output = zip(sentences, clusters)
# Print the cluster assignments for each sentence
for sentence, cluster in zip(sentences, clusters):
    print("Sentence:", sentence, "Cluster:", cluster)
df = pd.DataFrame(output)

db_file_name = '/workspaces/codespaces-flask/ThrAive/data/database1.db'
conn = sqlite3.connect(db_file_name)
cursor = conn.cursor()
cursor.execute("SELECT journal_text FROM Journal JOIN User ON Journal.id          
= user.id
rows = cursor.fetchall()
conn.commit()
conn.close()

df1 = pd.DataFrame(rows)
df1 = df1.applymap(lambda x: " ".join(x.split()) if isinstance(x, str)      
else x)
entry = df1
entry = entry
print(entry)
entry =  entry[0].iloc[-1].lower()
entry = [entry]
new_X = vectorizer.transform(entry)

# Predict the cluster assignments for the new sentences
new_clusters = kmeans.predict(new_X)
for entry, new_cluster in zip(entry, new_clusters):
    print("Sentence:", entry, "Cluster:", new_cluster)
zipper = zip(entry, new_clusters)
df = pd.DataFrame(zipper)
df = df.applymap(lambda x: " ".join(x.split()) if isinstance(x, str) 
 else x)
df = df.to_string(  header=False, index=False)
entry = df
output = entry
numbers = ['0', '1', '2', '3', '4','5','6','7','8']
names = 
# Create a dictionary that maps numbers to names
number_to_name = {number: name for number, name in zip(numbers, names)}
print(output[-1])
output = number_to_name[output[-1]]


json_string = json.dumps(str(output))

I think that the solution is saving the model to disk

    import pickle

# Train a scikit-learn model
model = ///

# Save the model to disk
with open('model.pkl', 'wb') as file:
pickle.dump(model, file)

and then load the pickle file and test it on the k-means without re-initializing the cluster.