0

how does one do k means on multiple columns in structured data ?

In the example below its been done on 1 column (name)

tfidf_matrix = tfidf_vectorizer.fit_transform(df_new['name'])

here only name is used but say we wanted to use name and country, should I be adding country to the same column as follows ?

df_new['name'] = df_new['name'] + " " + df_new['country']
tfidf_matrix = tfidf_vectorizer.fit_transform(df_new['name'])

It works from a code perspective and am still trying to understand the results (I actually have tons of columns) the data but I wonder if that is the right way to fit when there is more than one columns

import os
import pandas as pd
import re
import numpy as np

df = pd.read_csv('sample-data.csv')


def split_description(string):
    # name
    string_split = string.split(' - ',1)
    name = string_split[0]

    return name


df_new = pd.DataFrame()
df_new['name'] = df.loc[:,'description'].apply(lambda x: split_description(x))
df_new['id'] = df['id']


def remove(name):
    new_name = re.sub("[0-9]", '', name)
    new_name = ' '.join(new_name.split())
    return new_name

df_new['name'] = df_new.loc[:,'name'].apply(lambda x: remove(x))



from sklearn.feature_extraction.text import TfidfVectorizer


tfidf_vectorizer = TfidfVectorizer(
                                   use_idf=True,
                                   stop_words = 'english',
                                   ngram_range=(1,4), min_df = 0.01, max_df = 0.8)


tfidf_matrix = tfidf_vectorizer.fit_transform(df_new['name'])

print (tfidf_matrix.shape)
print (tfidf_vectorizer.get_feature_names())


from sklearn.metrics.pairwise import cosine_similarity
dist = 1.0 - cosine_similarity(tfidf_matrix)
print (dist)


from sklearn.cluster import KMeans
num_clusters = range(1,20)

KM = [KMeans(n_clusters=k, random_state = 1).fit(tfidf_matrix) for k in num_clusters]
Naresh MG
  • 633
  • 2
  • 11
  • 19
  • KMeans works on 2-d data. Have you tried using the Kmeans on your original dataset (without combining them into single column) and just converting them to numerical columns (like one-hot encoding, or binarizing) – Vivek Kumar Oct 05 '17 at 08:07
  • thx for your comment, I have not tried this out yet, but I have tonns of columns , do you think this is the route to go if I were to end up using some 30+ columns ? (some of which are description, for which encoding would not work) – Naresh MG Oct 05 '17 at 08:40
  • For columns which have text, tfidf is good, for categorical columns, one-hot encoding will be good. It doesnt matter how many columns you have, unless you have very less data (rows). If rows are sufficiently larger, than this is the basic approach to do. Once you have analysed the data, then other advanced feature selection and engineering techniques can be applied. – Vivek Kumar Oct 05 '17 at 09:09
  • I have some 100s of columns and am yet to figure out which ones to use . There r around 5000 rows. I will try it out as per ur suggestion and if i understand right u r saying I could pass a entire Data Frame to K means . Text columns as such and others one-hot encoded. – Naresh MG Oct 05 '17 at 16:37

1 Answers1

0

No, that is an incorrect way to fit multiple columns. You are basically simply jamming together multiple features together and expecting it to behave correctly as if kmeans was applied on these multiple columns as separate features.

You need to use other methods like Vectorizor and Pipelines along with tfidifVectorizor to do this on multiple columns. You can check out this link for more information.

Additionally, you can check out this answer for a possible alternate solution to your problem.

Gambit1614
  • 8,547
  • 1
  • 25
  • 51