Clustering by groups

Question

How can I perform clustering by groups? For example, take this Pokemon dataset on Kaggle.

A sample of this dataset looks like this (changed some fields to mimic my data):

Name                        Type I  Type II
Bulbasaur                   Grass   Poison  
Bulbasaur 2                 Grass   Poison  
Venusaur                    Grass   Not Null
VenusaurMega Venusaur       Grass   Not Null
...
Charizard                   Fire    Flying
CharizardMega Charizard X   Fire    Dragon

Supposing there are no nulls in my dataset, how can I group by the Type I and Type II columns respectively, and then cluster by similarity between names?

The output should be like so:

Name                        Type I  Type II  Cluster
Bulbasaur                   Grass   Poison   1
Bulbasaur 2                 Grass   Poison   1
Venusaur                    Grass   Not Null 2
VenusaurMega Venusaur       Grass   Not Null 2
...
Charizard                   Fire    Flying   3
CharizardMega Charizard X   Fire    Dragon   4

I tried a method similar as shown here, but it doesn't work with the NbClust function I am using.

clust <- NbClust(data, diss= string_dist, distance=NULL, min.nc = 2, max.nc = 125, method="ward.D2", index="ch")

See in the dupe target. I think this is what you are looking for. `rleid` will cluster nonconsecutive appearances of the same value into separate groups. — David Arenburg, May 25 '17 at 10:19
No clustering here, only group-by. What is the statistical optimization for clustering? — Has QUIT--Anony-Mousse, Jun 02 '17 at 21:45

Tonio Liebrand · Accepted Answer · 2017-05-25T22:52:45.893

1

You can use: rleid from library(data.table).

df <- fread("
#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
      2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
      3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
      3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
      4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
      5,Charmeleon,Fire,,405,58,64,58,80,65,80,1,False
      ")

Edit: (see the comments)

setDT(df, key=c("Type 1","Type 2"))[, Cluster:=.GRP, by = key(df)][]

edited May 25 '17 at 22:52

answered May 23 '17 at 08:59

Tonio Liebrand

17,189
4
39
59

1

`library(data.table)` – Sotos May 23 '17 at 09:08
I accepted this because this can "cluster" by group, but I think I should have clarified this portion of the question: Supposing there are no nulls in my dataset, how can I group by the Type I and Type II columns respectively, and then cluster by similarity between names? To be clear, I wanted to cluster the Pokemon names, after grouping by Type. – cocanut May 25 '17 at 00:38
2

`df[, clu:=rleid(\`Type 2\`)]` is more data.table-ish (your object `df` is a data.table) – jogo May 25 '17 at 08:48
https://stackoverflow.com/questions/32760524/combining-all-immediately-previous-rows-that-have-the-same-value-as-last-row-in – jogo May 25 '17 at 09:03
This answer looks wrong to me. What happens if `Poison` appears again in the data? It will be classified as a new group. My guess they are looking for `.GRP`. – David Arenburg May 25 '17 at 10:13
i took a look at the full data. you are right, it appears indeed again for some types. My bad. I made an edit. Thanks for the remark. – Tonio Liebrand May 25 '17 at 22:56

score 0 · Answer 2 · answered May 23 '17 at 09:05

0

We can use base R

df$cluster <- with(df, match(`Type II`, unique(`Type II`)))

answered May 23 '17 at 09:05

akrun

874,273
37
540
662

Clustering by groups

2 Answers2

Linked