0

Can someone please help modify the working example below to create clusters from the shared data?

The example uses Mean Shift clustering from Scikit-Learn to identify patches of similar/co-located plant species in an agronomical facility.

Similar questions about using categorical values in addition to the numeric values in these kinds of problems have been asked before, but I think this example is different for the following reason: The non-numeric values in this problem cannot be simply encoded with one and zero dummy values. For example, we can't One-Hot encode values like 'Aristolochia macrophylla' and 'Aristolochia durior' because species with this kind of similarity in their names need to be clustered together based on their family, in addition to their geographic proximity as given by the X and Y values. The similarity of the name is just as important as the location when creating the clusters.

I've tried two things: assigning arbitrary numeric values to the letters in the species name to show that a names with similar spelling would be closer together on a number line. I was going to apply auto-scaling to the values and plug into the script with the X and Y coordinates. This doesn't work because different names ended up very similar numerically.

My other attempt to incorporate the categorical values was through using the Levenstein distance. But the output of the distance is based on comparing only two values. And if you make an output showing the distance of each string to all the others, how can you implement that result as an input for the Meanshift algorithm?

Anyway, here is the data and working script that uses just the numeric values for now. I would really appreciate any examples of how to cluster this data using the similarity of the categorical values as well.

Thank you

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.datasets.samples_generator import make_blobs

df=pd.DataFrame()

df["POINT_X"]=[-75.933169765,-75.932900302,-75.933060039,-75.932456135,-75.932334122,-75.933383845,-75.933378563,-75.933290334,-75.933302506,-75.932024669,-75.931803297,-75.931777655,-75.9317845,-75.931807731,
               -75.931794839,-75.932045113,-75.932165473,-75.932763574,-75.93216276,-75.932066326,-75.931934871,-75.932294115,-75.931852284,-75.93187799,-75.932063549,-75.932377939,-75.932466697,-75.9324484,-75.932523695,
               -75.932484492,-75.931882652,-75.932006344,-75.932228988,-75.932702486,-75.933245229,-75.933165385,-75.932990797,-75.932741398,-75.932519195,-75.932336262,-75.932264764,-75.932953569,-75.932938167,-75.933098289,
               -75.932503985,-75.932597591,-75.932551382,-75.932541384,-75.932575066,-75.932751274,-75.932869969,-75.932086405,-75.932125915,-75.932089623,-75.932229816,-75.932356252,-75.93221234,-75.932505964,-75.932455199,
               -75.932672148,-75.932823439,-75.93266258,-75.932722695,-75.93262497,-75.932613958,-75.932726832,-75.933179618,-75.933413275,-75.932911947,-75.93293013,-75.933129681,-75.933348106,-75.933328068,-75.9333501,
               -75.933133529,-75.93306104,-75.933020824,-75.933056158,-75.933261164,-75.933157803,-75.933320158,-75.93306193,-75.932935915,-75.933125758,-75.933088069,-75.933158642,-75.9331282,-75.933096121,-75.933250109,
               -75.933325084,-75.933336448,-75.934785616,-75.934843128,-75.93387422,-75.933996517,-75.934114484,-75.934560855,-75.935138185,-75.935228902,-75.935550248,-75.935326059,-75.935167468,-75.935038326,-75.934937151,
               -75.934476218,-75.934576771,-75.934556169,-75.934324709,-75.934215059,-75.934185509,-75.933996183,-75.938853557,-75.937435702,-75.93755249,-75.93709863,-75.937584727,-75.937080786,-75.93717527,-75.937158245,
               -75.937153622,-75.937255458,-75.937291351,-75.937463492,-75.937508635,-75.937568922,-75.937604,-75.937643152,-75.937538299,-75.936224493,-75.936538213,-75.936653234,-75.936672687,-75.936781092,-75.936765158,
               -75.936775048,-75.93680606,-75.936808197,-75.936753824,-75.936637658,-75.936923553,-75.936872045,-75.936871187,-75.936735385,-75.936800934,-75.936504657,-75.936528774,-75.936462867,-75.936301988,-75.936248282,
               -75.936192436,-75.935933385,-75.93679036,-75.936984567,-75.937178376,-75.937072594,-75.936212479,-75.937100912,-75.937075027,-75.93703418,-75.936553923,-75.936563813,-75.936750108,-75.935328068,-75.93329076,
               -75.933274837,-75.932816577,-75.932958943,-75.932872736,-75.933039998,-75.932930987,-75.932975423,-75.932987859,-75.932944342,-75.932984985,-75.933102016,-75.933042959,-75.935432474,-75.93539475,-75.935456177,
               -75.935413297,-75.935564812,-75.936518316,-75.935680005,-75.936558194,-75.935736741,-75.935754977,-75.935809,-75.935866569,-75.936134435,-75.936272398,-75.936252114,-75.936497277,-75.936178069,-75.933545359,
               -75.933462287,-75.933528848,-75.933456247,-75.933508043,-75.933443108,-75.933436682,-75.933293086,-75.933458306,-75.932948828,-75.933541322,-75.933719067,-75.933560447,-75.934586709,-75.934531055,-75.93416494,
               -75.933882234,-75.934830229,-75.934978045,-75.934357619,-75.934605828,-75.934754661,-75.934743056,-75.934130125,-75.935928887,-75.936286533,-75.936425628,-75.936477105,-75.935622798,-75.935607342,-75.936576534,
               -75.936823941,-75.936664385,-75.936985859,-75.936927641,-75.937655315,-75.93754798,-75.937409554,-75.937780814,-75.936920843,-75.93724831,-75.937473965,-75.937712006,-75.935331673,-75.936250622,-75.934986449,
               -75.938144151,-75.938287148,-75.938572438,-75.938677207,-75.938737192,-75.936696505,-75.9379094,-75.937601482,-75.931082221,-75.931152233,-75.931929379,-75.931886037,-75.931539305,-75.93145414,-75.931517537,
               -75.93206476,-75.931104594,-75.930886831,-75.930796839,-75.930770692,-75.934395391,-75.933485857,-75.935094793,-75.935243938,-75.934978751,-75.935325475,-75.935361712,-75.933975927,-75.933883586,-75.936299827,
               -75.934936738,-75.935015301,-75.934930658,-75.935287011,-75.935294894,-75.937784172,-75.937770775,-75.938253481,-75.93826076,-75.937784726,-75.93717805,-75.938872368,-75.938875092,-75.939336652,-75.940266037,
               -75.940331239,-75.940421181,-75.940331999,-75.940177713,-75.939332917,-75.938994759,-75.939607395,-75.939598636,-75.939560673,-75.939534037,-75.939555948,-75.939015855,-75.939243491,-75.938789939,-75.933198497,
               -75.93296926,-75.933132717,-75.932772368,-75.932419051,-75.93293841,-75.932798596,-75.932208745,-75.93206523,-75.931983351,-75.932410373,-75.931891975,-75.931568921,-75.931771254,-75.932397243,-75.931396196,
               -75.931519619,-75.932093909,-75.931942073,-75.934429867,-75.934438719,-75.93453334,-75.934266886,-75.934183909,-75.93452075,-75.933856314,-75.933881074,-75.933901224,-75.933751983,-75.933594864,-75.93358154,
               -75.93347677,-75.933895768,-75.933917682,-75.933687372,-75.933927415,-75.933739282,-75.933891053,-75.933712267,-75.93361711,-75.933901067,-75.934161321,-75.934305249,-75.934239461,-75.934211658,-75.933980238,
               -75.934018133,-75.93397582,-75.933918536,-75.933971179,-75.933877169]

df["POINT_Y"]=[38.95259201,38.952468493,38.952585964,38.952220643,38.952172451,38.952978948,38.952611101,38.952620123,38.952527583,38.952013642,38.951971095,38.951950598,38.951878617,38.951867573,38.952051039,38.952319899,
               38.952751776,38.952261808,38.951645828,38.951591344,38.951583443,38.951660428,38.951750197,38.951752666,38.951776696,38.951792968,38.951787078,38.951862848,38.951800999,38.951744805,38.951870508,38.951889649,
               38.951936158,38.95170948,38.951751749,38.951735386,38.951742727,38.951588575,38.951528477,38.951520106,38.951519453,38.951936698,38.952010261,38.952013956,38.952102079,38.952165877,38.952146088,38.952089106,
               38.952117254,38.952151545,38.949969545,38.951201998,38.951159228,38.951123753,38.950778391,38.950531943,38.950989092,38.950097211,38.950208568,38.950065183,38.950071356,38.949923603,38.9498474,38.949809668,
               38.949757376,38.949571133,38.951447294,38.95147755,38.950581745,38.950733667,38.951069352,38.951237478,38.95107276,38.95096753,38.9508122,38.950734862,38.950688169,38.950514372,38.950075351,38.950010511,38.949960875,
               38.949992064,38.95007398,38.950101272,38.950295815,38.950227769,38.950211517,38.950441255,38.950335632,38.95024686,38.950307666,38.950528546,38.950513096,38.950187972,38.950217841,38.950263645,38.950510523,
               38.950755399,38.950708302,38.950286311,38.950229957,38.950164615,38.950045229,38.949970825,38.949877169,38.949993101,38.949660647,38.949543522,38.949625589,38.949412861,38.949487811,38.949880172,38.951839048,
               38.952063455,38.949880835,38.951913953,38.949897842,38.949754481,38.949913573,38.951052934,38.951134326,38.951215119,38.951281057,38.951294341,38.951397886,38.951533389,38.951672146,38.949658462,38.950068808,
               38.949883166,38.949852263,38.949919533,38.950057898,38.950028999,38.950188832,38.950304129,38.950435138,38.950514515,38.950622084,38.950381874,38.949994828,38.950052327,38.949830647,38.949824853,38.949732702,
               38.949761675,38.949791427,38.949879419,38.949914074,38.949955099,38.951691376,38.951766177,38.951785811,38.951832242,38.951733008,38.950873805,38.951440038,38.951405074,38.951254936,38.951212584,38.951201821,
               38.951198089,38.951901959,38.94884403,38.948941748,38.949353979,38.949035993,38.949016785,38.94887402,38.948802413,38.948722997,38.94868013,38.948698153,38.948609493,38.948407937,38.948413538,38.94884251,
               38.948821237,38.948818421,38.948795076,38.949678178,38.949281509,38.949751466,38.949261269,38.949715525,38.949652229,38.949566304,38.949532396,38.949542936,38.949567821,38.94953658,38.949563742,38.948735942,
               38.952147575,38.952155751,38.951912912,38.951985954,38.952728799,38.952622921,38.952451597,38.952436249,38.95231594,38.952313127,38.951745893,38.952390373,38.952286187,38.952708734,38.951839413,38.952030386,
               38.951616852,38.951420298,38.951608998,38.952554863,38.9520134,38.951292914,38.951667791,38.952112184,38.954031241,38.953799626,38.953837241,38.953853864,38.953692287,38.953686947,38.953751245,38.953616457,
               38.95369262,38.953694331,38.953744736,38.953742862,38.953858308,38.953767308,38.953659111,38.953499777,38.953494864,38.953676808,38.953570088,38.953574927,38.953146008,38.953138966,38.953219752,38.953218684,
               38.953196026,38.953217491,38.953260642,38.953365184,38.953343071,38.953392347,38.95584336,38.955799692,38.956182326,38.95621302,38.956049617,38.957470088,38.957171152,38.956453402,38.956649954,38.956791692,
               38.957180989,38.957521592,38.955754158,38.95553646,38.955953035,38.956405511,38.956660878,38.957086511,38.957423389,38.957793854,38.957835976,38.955448024,38.955021013,38.954934154,38.954927544,38.954598007,
               38.954570833,38.954367294,38.954343,38.954497793,38.954471,38.954821256,38.954369125,38.955348715,38.955333171,38.955343991,38.955489753,38.955493927,38.955516735,38.955049181,38.955110383,38.954724398,38.954521524,
               38.954517463,38.954512208,38.954493542,38.954434212,38.954117479,38.95435162,38.954310712,38.954277052,38.954161078,38.954580606,38.954197375,38.955451505,38.955596079,38.955045523,38.955097295,38.955970146,
               38.954232335,38.95411988,38.953505553,38.955288869,38.955759644,38.955647996,38.955040953,38.954949777,38.95485026,38.954643337,38.954546745,38.953547289,38.953542137,38.953995634,38.954146947,38.954862356,
               38.953287566,38.954523419,38.954915863,38.955002144,38.954945777,38.955006524,38.95507815,38.955120243,38.953067979,38.953073084,38.953453648,38.953640022,38.953641026,38.954062633,38.954027667,38.954110137,
               38.954249401,38.953874232,38.953529725,38.953628972,38.953476826,38.95351151,38.953498365,38.953491846,38.953767787,38.953843351,38.953849161]

#Must incorporate these identifiers and cluster by similarity of species in addition to their proximity.
df["Category"]=['Aristolochia macrophylla', 'Aristolochia macrophylla', 'Aristolochia macrophylla', 'Aristolochia macrophylla', 'Aristolochia macrophylla', 'Aristolochia macrophylla', 'Aristolochia macrophylla',
                'Aristolochia macrophylla', 'Aristolochia macrophylla', 'Aristolochia macrophylla', 'Aristolochia macrophylla', 'Aristolochia macrophylla', 'Aristolochia macrophylla', 'Aristolochia macrophylla',
                'Aristolochia macrophylla', 'Aristolochia macrophylla', 'Aristolochia macrophylla', 'Aristolochia macrophylla', 'Aristolochia durior', 'Aristolochia durior', 'Aristolochia durior', 'Aristolochia durior',
                'Aristolochia durior', 'Aristolochia durior', 'Aristolochia durior', 'Aristolochia durior', 'Aristolochia durior', 'Aristolochia durior', 'Aristolochia durior', 'Aristolochia durior', 'Aristolochia durior',
                'Aristolochia durior', 'Aristolochia durior', 'Aristolochia durior', 'Aristolochia tomentosa', 'Aristolochia tomentosa', 'Aristolochia tomentosa', 'Aristolochia tomentosa', 'Aristolochia tomentosa',
                'Aristolochia tomentosa', 'Aristolochia tomentosa', 'Aristolochia tomentosa', 'Aristolochia tomentosa', 'Aristolochia tomentosa', 'Aristolochia tomentosa', 'Aristolochia tomentosa', 'Aristolochia tomentosa',
                'Aristolochia tomentosa', 'Aristolochia tomentosa', 'Aristolochia tomentosa', 'Buddleia davidii', 'Buddleia davidii', 'Buddleia davidii', 'Buddleia davidii', 'Buddleia davidii', 'Buddleia davidii',
                'Buddleia davidii', 'Buddleia davidii', 'Buddleia davidii', 'Buddleia davidii', 'Buddleia davidii', 'Buddleia davidii', 'Buddleia davidii', 'Buddleia davidii', 'Buddleia davidii', 'Buddleia davidii',
                'Buddleia x weyeriana', 'Buddleia x weyeriana', 'Buddleia x weyeriana', 'Buddleia x weyeriana', 'Buddleia x weyeriana', 'Buddleia x weyeriana', 'Buddleia x weyeriana', 'Buddleia x weyeriana', 'Buddleia x weyeriana',
                'Buddleia x weyeriana', 'Buddleia x weyeriana', 'Buddleia x weyeriana', 'Chamaecyparis obtusa', 'Chamaecyparis obtusa', 'Chamaecyparis obtusa', 'Chamaecyparis obtusa', 'Chamaecyparis obtusa', 'Chamaecyparis obtusa',
                'Chamaecyparis obtusa', 'Chamaecyparis obtusa', 'Chamaecyparis obtusa', 'Chamaecyparis obtusa', 'Chamaecyparis obtusa', 'Chamaecyparis obtusa', 'Chamaecyparis obtusa', 'Chamaecyfoccia gracilis',
                'Chamaecyfoccia gracilis', 'Chamaecyfoccia gracilis', 'Chamaecyfoccia gracilis', 'Chamaecyfoccia gracilis', 'Chamaecyfoccia gracilis', 'Chamaecyfoccia gracilis', 'Chamaecyfoccia gracilis', 'Chamaecyparis pisifera',
                'Chamaecyparis pisifera', 'Chamaecyparis pisifera', 'Chamaecyparis pisifera', 'Chamaecyparis pisifera', 'Chamaecyparis pisifera', 'Chamaecyparis pisifera', 'Chamaecyparis pisifera', 'Chamaecyparis pisifera',
                'Chamaecyparis pisifera', 'Chamaecyparis pisifera', 'Chamaecyparis pisifera', 'Cornus alba', 'Cornus alba', 'Cornus alba', 'Cornus alba', 'Cornus alba', 'Cornus alba', 'Cornus alba', 'Cornus alba', 'Cornus alba',
                'Cornus alba', 'Cornus alba', 'Cornus alba', 'Cornus alba', 'Cornus alba', 'Cornus alba', 'Cornus alba', 'Cornus alba', 'Cornus albernifolia', 'Cornus albernifolia', 'Cornus albernifolia', 'Cornus albernifolia',
                'Cornus albernifolia', 'Cornus albernifolia', 'Cornus albernifolia', 'Cornus albernifolia', 'Cornus albernifolia', 'Cornus albernifolia', 'Cornus albernifolia', 'Cornus albernifolia', 'Cornus albernifolia',
                'Cornus albernifolia', 'Cornus albernifolia', 'Cornus albernifolia', 'Cornus albernifolia', 'Cornus albernifolia', 'Cornus albernifolia', 'Cornus albernifolia', 'Cornus albernifolia', 'Cornus albernifolia',
                'Cornus canadensis', 'Cornus canadensis', 'Cornus canadensis', 'Cornus canadensis', 'Cornus canadensis', 'Cornus canadensis', 'Cornus canadensis', 'Cornus canadensis', 'Cornus canadensis', 'Cornus canadensis',
                'Cornus canadensis', 'Cornus canadensis', 'Cornus canadensis', 'Euonymus alata', 'Euonymus alata', 'Euonymus alata', 'Euonymus alata', 'Euonymus alata', 'Euonymus alata', 'Euonymus alata', 'Euonymus alata',
                'Euonymus alata', 'Euonymus alata', 'Euonymus alata', 'Euonymus alata', 'Euonymus alata', 'Euphorbia pulcherrima', 'Euphorbia pulcherrima', 'Euphorbia pulcherrima', 'Euphorbia pulcherrima', 'Euphorbia pulcherrima',
                'Euphorbia pulcherrima', 'Euphorbia pulcherrima', 'Euphorbia pulcherrima', 'Euphorbia pulcherrima', 'Euphorbia pulcherrima', 'Euphorbia pulcherrima', 'Euphorbia pulcherrima', 'Euphorbia pulcherrima',
                'Euphorbia pulcherrima', 'Euphorbia pulcherrima', 'Euphorbia pulcherrima', 'Euphorbia pulcherrima', 'Galanthus nivalis', 'Galanthus nivalis', 'Galanthus nivalis', 'Galanthus nivalis', 'Galanthus nivalis',
                'Galanthus nivalis', 'Galanthus nivalis', 'Galanthus nivalis', 'Galanthus nivalis', 'Galanthus nivalis', 'Galanthus nivalis', 'Galanthus nivalis', 'Galanthus nivalis', 'Galanthus nivalisodoratum',
                'Galanthus nivalisodoratum', 'Galanthus nivalisodoratum', 'Galanthus nivalisodoratum', 'Galanthus nivalisodoratum', 'Galanthus nivalisodoratum', 'Galanthus nivalisodoratum', 'Galanthus nivalisodoratum',
                'Galanthus nivalisodoratum', 'Galanthus nivalisodoratum', 'Galanthus nivalisodoratum', 'Hakonechloa macra', 'Hakonechloa macra', 'Hakonechloa macra', 'Hakonechloa macra', 'Hakonechloa macra', 'Hakonechloa macra',
                'Hakonechloa macra', 'Hakonechloa macra', 'Hakonechloa macra', 'Hakonechloa macra', 'Hakonechloa macra', 'Hakonechloa macra', 'Hakonechloa macra', 'Hakonechloa macra', 'Hakonechloa macra', 'Hakonechloa macra',
                'Hakonechloa macra', 'Hakonechloa macra', 'Hakonechloa macra', 'Hakonechloa aureola-macra', 'Hakonechloa aureola-macra', 'Hakonechloa aureola-macra', 'Hakonechloa aureola-macra', 'Hakonechloa aureola-macra',
                'Hakonechloa aureola-macra', 'Hakonechloa aureola-macra', 'Hakonechloa aureola-macra', 'Hakonechloa aureola-macra', 'Hakonechloa aureola-macra', 'Hakonechloa aureola-macra', 'Ilex crenata Hetzii',
                'Ilex crenata Hetzii', 'Ilex crenata Hetzii', 'Ilex crenata Hetzii', 'Ilex crenata Hetzii', 'Ilex crenata Hetzii', 'Ilex crenata Hetzii', 'Ilex crenata Hetzii', 'Ilex crenata Hetzii', 'Ilex crenata Hetzii',
                'Ilex crenata Hetzii', 'Ilex crenata Hetzii', 'Iberis sempervirens', 'Iberis sempervirens', 'Iberis sempervirens', 'Iberis sempervirens', 'Iberis sempervirens', 'Iberis sempervirens', 'Iberis sempervirens',
                'Iberis sempervirens', 'Iberis sempervirens', 'Lamium maculatum', 'Lamium maculatum', 'Lamium maculatum', 'Lamium maculatum', 'Lamium maculatum', 'Lamium maculatum', 'Lamium maculatum', 'Lamium maculatum',
                'Lamium maculatum', 'Lamium maculatum', 'Lamium maculatum', 'Lamium maculatum', 'Mertensia virginica', 'Mertensia virginica', 'Mertensia virginica', 'Mertensia virginica', 'Mertensia virginica', 'Mertensia virginica',
                'Mertensia virginica', 'Mertensia virginica', 'Aristolochata pseudophilus', 'Aristolochata pseudophilus', 'Aristolochata pseudophilus', 'Aristolochata pseudophilus', 'Aristolochata pseudophilus',
                'Aristolochata pseudophilus', 'Aristolochata pseudophilus', 'Aristolochata pseudophilus', 'Aristolochata pseudophilus', 'Aristolochata pseudophilus', 'Chamaecyparis duplicatus', 'Chamaecyparis duplicatus',
                'Chamaecyparis duplicatus', 'Chamaecyparis duplicatus', 'Chamaecyparis duplicatus', 'Chamaecyparis duplicatus', 'Chamaecyparis duplicatus', 'Chamaecyparis duplicatus', 'Chamaecyparis crenata Hetzii',
                'Chamaecyparis crenata Hetzii', 'Chamaecyparis crenata Hetzii', 'Chamaecyparis crenata Hetzii', 'Chamaecyparis crenata Hetzii', 'Chamaecyparis crenata Hetzii', 'Chamaecyparis crenata Hetzii', 'Chamaecyparis',
                'Chamaecyparis', 'Chamaecyparis', 'Chamaecyparis', 'Veronicastrum virginicum', 'Veronicastrum virginicum', 'Veronicastrum virginicum', 'Veronicastrum virginicum', 'Veronicastrum virginicum', 'Veronicastrum virginicum',
                'Veronicastrum virginicum', 'Veronicastrum virginicum', 'Veronicastrum virginicum', 'Veronicastrum virginicum', 'Veronicastrum virginicum', 'Veronicastrum virginicum', 'Veronicastrum virginicum',
                'Veronicastrum vulgaris', 'Veronicastrum vulgaris', 'Veronicastrum vulgaris', 'Veronicastrum vulgaris', 'Veronicastrum vulgaris', 'Veronicastrum vulgaris', 'Veronicastrum vulgaris', 'Veronicastrum vulgaris',
                'Veronicastrum vulgaris', 'Veronicastrum pulchra', 'Veronicastrum pulchra', 'Veronicastrum pulchra', 'Veronicastrum pulchra', 'Veronicastrum pulchra', 'Veronicastrum pulchra', 'Veronicastrum pulchra',
                'Veronicastrum pulchra', 'Veronicastrum pulchra', 'Veronicastrum pulchra']



#Get clusters with MeanShift
X= np.array(df.loc[:,["POINT_X","POINT_Y"]].values.tolist()) # Only using numeric values for now
bandwidth = estimate_bandwidth(X, quantile=0.0595, n_samples=15000)
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)
print("Estimated number of clusters: %d" % n_clusters_)

#Make plot
plt.figure(1)
plt.clf()
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')

for k, col in zip(range(n_clusters_), colors):
    my_members = labels == k
    cluster_center = cluster_centers[k]
    plt.plot(X[my_members, 0], X[my_members, 1], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)
plt.title('Clusters found by X/Y proximity (before using categorical values): %d' % n_clusters_)
plt.show(); plt.show()
Kdog
  • 503
  • 5
  • 20
  • You can divide your category column into two or more columns (for genus, class, species etc), and then do the one hot encoding. – Vivek Kumar Jul 04 '18 at 06:22
  • Why not. You will have the same one-hot-encoding when genus for both is same:- `'Aristolochia'` – Vivek Kumar Jul 04 '18 at 09:54
  • I see what you mean but I really need to find a solution based on the similarity of the strings in that column. – Kdog Jul 04 '18 at 09:55
  • In zoology you can have the same words repeated in the genus and species and sometimes even the family. This would cluster unrelated species. But I think if the data didn't have this quirk, your solution would work. Thank you! – Kdog Jul 04 '18 at 10:06
  • Can you give an example of what you said above? Because in my knowledge the binomial names should be unique within a kingdom. And even across multiple kingdoms, these duplicates should be rare. – Vivek Kumar Jul 04 '18 at 10:19
  • They're called tautonyms: https://en.wikipedia.org/wiki/List_of_tautonyms. They are rare but I'm using them as an example while looking for a solution that can be applied to data with other meaning where string similarity is a factor. – Kdog Jul 04 '18 at 10:31
  • Oh, you are talking about these repeating words. They will not be of problem because each column will be one-hot encoded separately. So generic name and specific names will be independently encoded even if they are same. And model will treat them as such. – Vivek Kumar Jul 04 '18 at 10:35
  • Definitely. The tautonyms are just an illustration though. Per your advice I'm going to try to separate the individual words in the categorical value column and one hot encode them. I was hoping to work in something like the Levenstein distance metric, but I cant think of how to add it as an input column. Here is what that colution may look like: https://stackoverflow.com/questions/38720283/python-string-clustering-with-scikit-learns-dbscan-using-levenshtein-distance – Kdog Jul 04 '18 at 11:08

0 Answers0