Labelling clustered data

Question

I hope somebody can help me. I've been trying to wonder how my data is clustered and for that I've been using k-means and the elbow method in R as suggested in the R-bloggers blog.

Here's a sample of how my data looks like (datanet). I've based my cluster analysis on the first three columns ACTIVITY X, ACTIVITY Y and ACTIVITY Z:

   ACTIVITY_X ACTIVITY_Y ACTIVITY_Z   Event
1:         40         47         62 Vigilance
2:         60         74         95 Head-up
3:         62         63         88 Head-up
4:         60         56         82 Head-up
5:         66         61         90 Head-up
6:         60         53         80 Head-up

I've obtained a between_SS/total_SS score of 89.0 % for k=4, so I'm pretty confident that my data is grouped into 4 clusters.

Now, I'd like to know if my data is 4 clustered based on the different labels on the Event column of the data sample above.

I've used tsclust() with a fuzzy-type of clustering to see how my data clusters. Here's how I implemented my code:

library(dtwclust)
trainset1 <- datanet[, !"Event"]
train = as.matrix(trainset1, byrow = T, ncol=3)
head(train)
train_clust<- tsclust(train, k = 4L, type = "fuzzy")
plot(train_clust@cluster)
plot(train_clust)

Last command plot(train_clust) allows me to find the respective centroids of the different clusters:

plot(train_clust@cluster) shows which cluster each datapoint falls into:

However, is there a way to know which label out of the Event column each datapoint represents? As stated earlier, I entered my data for tsclust() as a train matrix only including the first three columns of my data as shown above (since they're the ones with the values).

How can I implement the third column "Event" so that each datapoint has an associated label (Head-up, Vigilance etc.)?

My aim would be to end-up with a cluster plot that resembles something like this:

Hope this question was interesting as I'm fairly new to R. Any input is appreciated!

P.S. As asked on the comments:

> dput(datanet)
structure(list(ACTIVITY_X = c(40L, 60L, 62L, 60L, 66L, 60L, 57L, 
54L, 52L, 93L, 80L, 14L, 52L, 61L, 51L, 40L, 20L, 21L, 5L, 53L, 
48L, 73L, 73L, 21L, 29L, 63L, 59L, 57L, 51L, 53L, 67L, 72L, 74L, 
70L, 60L, 74L, 85L, 77L, 68L, 58L, 80L, 34L, 45L, 34L, 60L, 75L, 
62L, 66L, 51L, 53L, 48L, 62L, 62L, 57L, 5L, 1L, 12L, 23L, 5L, 
4L, 0L, 13L, 45L, 44L, 31L, 68L, 88L, 43L, 70L, 18L, 83L, 71L, 
67L, 75L, 74L, 49L, 90L, 44L, 64L, 57L, 22L, 29L, 52L, 37L, 32L, 
120L, 45L, 22L, 54L, 30L, 9L, 27L, 14L, 3L, 29L, 12L, 10L, 61L, 
60L, 29L, 15L, 7L, 6L, 0L, 2L, 0L, 4L, 1L, 7L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 1L, 15L, 23L, 49L, 46L, 8L, 31L, 45L, 60L, 
31L, 37L, 61L, 52L, 51L, 38L, 86L, 60L, 41L, 43L, 40L, 42L, 42L, 
48L, 64L, 71L, 59L, 0L, 11L, 27L, 12L, 3L, 0L, 0L, 8L, 0L, 21L, 
6L, 2L, 7L, 4L, 3L, 3L, 46L, 46L, 59L, 53L, 37L, 44L, 39L, 49L, 
37L, 47L, 17L, 36L, 32L, 33L, 26L, 12L, 8L, 25L, 31L, 35L, 27L, 
27L, 24L, 17L, 35L, 39L, 28L, 54L, 5L, 0L, 0L, 0L, 0L, 0L, 17L, 
22L, 25L, 12L, 0L, 5L, 41L, 51L, 66L, 39L, 32L, 53L, 43L, 40L, 
44L, 45L, 48L, 51L, 41L, 45L, 39L, 46L, 59L, 31L, 5L, 24L, 18L, 
5L, 15L, 13L, 0L, 12L, 26L, 0L), ACTIVITY_Y = c(47L, 74L, 63L, 
56L, 61L, 53L, 40L, 41L, 49L, 32L, 54L, 13L, 39L, 99L, 130L, 
38L, 14L, 6L, 5L, 94L, 96L, 38L, 43L, 29L, 47L, 66L, 47L, 38L, 
31L, 36L, 35L, 38L, 72L, 54L, 44L, 45L, 51L, 80L, 48L, 39L, 85L, 
42L, 39L, 37L, 75L, 36L, 45L, 32L, 35L, 41L, 26L, 99L, 163L, 
124L, 0L, 0L, 24L, 37L, 0L, 6L, 0L, 29L, 29L, 26L, 27L, 54L, 
147L, 82L, 98L, 12L, 83L, 97L, 104L, 128L, 81L, 42L, 102L, 60L, 
79L, 58L, 15L, 14L, 75L, 75L, 40L, 130L, 40L, 13L, 54L, 42L, 
7L, 10L, 3L, 0L, 15L, 8L, 7L, 75L, 55L, 26L, 18L, 1L, 13L, 0L, 
0L, 0L, 1L, 0L, 4L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 5L, 
17L, 45L, 38L, 10L, 31L, 52L, 36L, 24L, 65L, 97L, 45L, 59L, 49L, 
92L, 51L, 34L, 21L, 20L, 29L, 28L, 22L, 32L, 30L, 86L, 0L, 4L, 
15L, 7L, 4L, 0L, 0L, 0L, 0L, 11L, 3L, 0L, 1L, 3L, 1L, 0L, 72L, 
62L, 98L, 55L, 26L, 39L, 28L, 81L, 20L, 52L, 12L, 48L, 24L, 40L, 
30L, 5L, 6L, 44L, 40L, 37L, 33L, 26L, 17L, 14L, 39L, 27L, 28L, 
67L, 0L, 0L, 0L, 0L, 0L, 0L, 10L, 12L, 14L, 7L, 0L, 2L, 39L, 
67L, 74L, 28L, 23L, 57L, 34L, 36L, 36L, 37L, 46L, 43L, 73L, 65L, 
31L, 64L, 128L, 17L, 3L, 12L, 17L, 0L, 9L, 7L, 0L, 7L, 17L, 0L
), ACTIVITY_Z = c(62L, 95L, 88L, 82L, 90L, 80L, 70L, 68L, 71L, 
98L, 97L, 19L, 65L, 116L, 140L, 55L, 24L, 22L, 7L, 108L, 107L, 
82L, 85L, 36L, 55L, 91L, 75L, 69L, 60L, 64L, 76L, 81L, 103L, 
88L, 74L, 87L, 99L, 111L, 83L, 70L, 117L, 54L, 60L, 50L, 96L, 
83L, 77L, 73L, 62L, 67L, 55L, 117L, 174L, 136L, 5L, 1L, 27L, 
44L, 5L, 7L, 0L, 32L, 54L, 51L, 41L, 87L, 171L, 93L, 120L, 22L, 
117L, 120L, 124L, 148L, 110L, 65L, 136L, 74L, 102L, 81L, 27L, 
32L, 91L, 84L, 51L, 177L, 60L, 26L, 76L, 52L, 11L, 29L, 14L, 
3L, 33L, 14L, 12L, 97L, 81L, 39L, 23L, 7L, 14L, 0L, 2L, 0L, 4L, 
1L, 8L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 16L, 29L, 67L, 
60L, 13L, 44L, 69L, 70L, 39L, 75L, 115L, 69L, 78L, 62L, 126L, 
79L, 53L, 48L, 45L, 51L, 50L, 53L, 72L, 77L, 104L, 0L, 12L, 31L, 
14L, 5L, 0L, 0L, 8L, 0L, 24L, 7L, 2L, 7L, 5L, 3L, 3L, 85L, 77L, 
114L, 76L, 45L, 59L, 48L, 95L, 42L, 70L, 21L, 60L, 40L, 52L, 
40L, 13L, 10L, 51L, 51L, 51L, 43L, 37L, 29L, 22L, 52L, 47L, 40L, 
86L, 5L, 0L, 0L, 0L, 0L, 0L, 20L, 25L, 29L, 14L, 0L, 5L, 57L, 
84L, 99L, 48L, 39L, 78L, 55L, 54L, 57L, 58L, 66L, 67L, 84L, 79L, 
50L, 79L, 141L, 35L, 6L, 27L, 25L, 5L, 17L, 15L, 0L, 14L, 31L, 
0L), Event = c("Head-up", "Head-up", "Head-up", "Head-up", "Head-up", 
"Head-up", "Head-up", "Head-up", "Head-up", "Moving", "Moving", 
"Head-up", "Head-up", "Head-up", "Head-up", "Head-up", "Head-up", 
"Head-up", "Moving", "Head-up", "Head-up", "Head-up", "Head-up", 
"Head-up", "Head-up", "Head-up", "Head-up", "Head-up", "Head-up", 
"Head-up", "Head-up", "Head-up", "Head-up", "Head-up", "Head-up", 
"Head-up", "Head-up", "Head-up", "Head-up", "Head-up", "Head-up", 
"Head-up", "Head-up", "Head-up", "Head-up", "Head-up", "Head-up", 
"Head-up", "Head-up", "Head-up", "Head-up", "Head-up", "Head-up", 
"Head-up", "Grazing", "Grazing", "Moving", "Moving", "Grazing", 
"Grazing", "Grazing", "Moving", "Moving", "Grazing", "Grazing", 
"Grazing", "Grazing", "Grooming", "Grooming", "Grazing", "Grazing", 
"Grooming", "Head-up", "Head-up", "Vigilance", "Grazing", "Grazing", 
"Grazing", "Grazing", "Vigilance", "Grazing", "Grazing", "Grazing", 
"Grazing", "Moving", "Grazing", "Grazing", "Grazing", "Grazing", 
"Grazing", "Moving", "Vigilance", "Vigilance", "Vigilance", "Head-up", 
"Head-up", "Head-up", "Head-up", "Grazing", "Grazing", "Grazing", 
"Grazing", "Grazing", "Grazing", "Grazing", "Grazing", "Grazing", 
"Grazing", "Grazing", "Head-up", "Head-up", "Head-up", "Head-up", 
"Head-up", "Head-up", "Head-up", "Head-up", "Head-up", "Head-up", 
"Head-up", "Grazing", "Grazing", "Grazing", "Grazing", "Grazing", 
"Grooming", "Grazing", "Head-up", "Head-up", "Head-up", "Head-up", 
"Head-up", "Head-up", "Head-up", "Head-up", "Head-up", "Head-up", 
"Head-up", "Head-up", "Head-up", "Moving", "Moving", "Vigilance", 
"Vigilance", "Grazing", "Head-up", "Head-up", "Head-up", "Head-up", 
"Head-up", "Moving", "Grazing", "Head-up", "Head-up", "Head-up", 
"Head-up", "Head-up", "Head-up", "Head-up", "Head-up", "Head-up", 
"Head-up", "Head-up", "Head-up", "Head-up", "Head-up", "Head-up", 
"Head-up", "Head-up", "Head-up", "Head-up", "Head-up", "Head-up", 
"Head-up", "Head-up", "Head-up", "Head-up", "Head-up", "Grazing", 
"Grazing", "Grazing", "Grazing", "Head-up", "Head-up", "Grazing", 
"Head-up", "Vigilance", "Head-up", "Head-up", "Head-up", "Moving", 
"Head-up", "Head-up", "Head-up", "Head-up", "Head-up", "Head-up", 
"Head-up", "Head-up", "Head-up", "Head-up", "Head-up", "Head-up", 
"Head-up", "Head-up", "Head-up", "Head-up", "Head-up", "Head-up", 
"Head-up", "Head-up", "Head-up", "Head-up", "Head-up", "Head-up", 
"Head-up", "Head-up", "Head-up", "Head-up", "Head-up", "Head-up", 
"Head-up", "Head-up", "Head-up", "Head-up", "Head-up", "Head-up", 
"Vigilance")), row.names = c(NA, -228L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x00000000051e1ef0>)

Once you have a dataset with your clusters you can join that back to your original dataset. You can then plot it with different indicators for the clusters and for the event. If you post your data using `dput(your_data)` it would be easier to show you. — Ben G, Feb 28 '19 at 14:07
There's a lot more code here than is needed for the issue--makes it hard to pinpoint exactly what the problem is. Like @BenG I think it's actually just a joining/merging problem, but it takes some digging to see that — camille, Feb 28 '19 at 14:17
If you want to know if your data is 4 clustered based on the Event column, maybe you should start with `table(Event, Cluster)` ? — G5W, Feb 28 '19 at 14:22
@BenG I've posted the data using `dput()` as you suggested. Hope this helps! — juansalix, Feb 28 '19 at 14:27
@G5W so I should creat a table with the `Event` column and `Cluster` as arguments right? But then, what would be the `Cluster` argument coming from? — juansalix, Feb 28 '19 at 14:28
What's the dataset you posted? Is that `trainset1`? Your `dput` command refers to `datanet`, which is not in your code — camille, Feb 28 '19 at 14:34
@camille Thanks for pointing that out, just fixed the issue. Everything should be ok now. — juansalix, Feb 28 '19 at 14:46
maybe you should use `cutree` function up to cluster object, I recommend use the core default `hcluster` also. Something like `hc <- hclust(as.dist(your_matrix))` and then `groups <- cutree(hc, k=5)` the `k=number` access the k-means. If it is something like you need, let me know to I post a complete answer. — Aureliano Guedes, Feb 28 '19 at 14:48
@G5W Thanks. I'm getting an error message though: `> table(Event,train_clust@cluster) Error in table(Event, train_clust@cluster) : object 'Event' not found` Any ideas about how to fix it? — juansalix, Feb 28 '19 at 14:49
You need `datanet$Event` but when I wrote my first comment, I did not know the name of your data.frame. — G5W, Feb 28 '19 at 14:51

score 1 · Accepted Answer · answered Feb 28 '19 at 16:07

1

Not familiar with the package you used to do the clustering, so I had some fun figuring out how to extract the data from it. Here's how I did it:

library(tidyverse)  # This is a large package with many data manipulation functions
library(dtwclust)

trainset1 <- datanet %>%
             select(-Event)

train <- as.matrix(trainset1, byrow = T, ncol=3)

train_clust <- tsclust(train, k = 4L, type = "fuzzy")

clusters <- tibble(cluster = c(train_clust@cluster))

combined_set <- bind_cols(datanet, clusters)

combined_set %>% 
  # I find the ggplot2 package much better for graphing than the base package
  ggplot(aes(ACTIVITY_X, ACTIVITY_Y, color = as.factor(cluster), shape = Event)) +
  geom_point()

I would highly recommend checking out the tidyverse for data manipulation and data science in general. Check out this FREE ebook: https://r4ds.had.co.nz/.

answered Feb 28 '19 at 16:07

Ben G

4,148
2
22
42

Thank you for your answer. I was wondering now how I could create a table with the number of times an event is represented on each one of the four clusters. I went with `table(train_clust@cluster,datanet$Event)` Is that the right way? – juansalix Mar 01 '19 at 12:30
I never really have the need to build a contingency table, but that sounds right. – Ben G Mar 01 '19 at 14:42
I'm sorry to bother again. I was wondering if your script could/does include a certain likelhood value of the probability of an `Event` to belong to a certain `cluster`? Please let me know! – juansalix Mar 05 '19 at 13:16
You'd have to calculate that separately--pretty different question so I would post again as a different question if you can't figure it out. – Ben G Mar 05 '19 at 14:26

Labelling clustered data

1 Answers1