0

I've got my clusterisation done, now, I want to use it to replace missing values. My idea is to compute a representative for each cluster then replace missing values according to that representative. The problem is... I don't really know how to do that.

I searched about it and found this question, which seems to almost answer my issue (finding a representative would also work for me), but I don't understand enough of it to use it.

library(data.table)
library(dplyr)
library(tidyr)
library(TSclust)
set.seed(1)
df = data.table(
  "Time" = c(1,2,3,4,5),
  "1" = runif(5),
  "2" = runif(5),
  "3" = runif(5),
  "4" = runif(5),
  "5" = runif(5),
  "6" = runif(5))

clusters = hclust(diss(ts(df[,-1]), "EUCL"))
tree = cutree(clusters, 3)

rep = df%>%
  gather(key = ID,value = Conso, -Time)%>%
  mutate(Cluster = as.vector(sapply(tree, FUN = rep,times = 5)))%>%
  group_by(Cluster, Time)%>%
  summarise(Conso = mean(Conso))

Here's something close to my actual data, and here's some naive way to compute some representatives.

Is this actually an ok way to do it ? Do you know a way to extract those representatives from clusters ?

FBiggio
  • 35
  • 1
  • 5
  • 1
    When it comes to time series, many researchers argue that a mean centroid is a poor representative, but it is definitely a possibility, and you're calculating it correctly. You could read Section 3 in [this vignette](https://cran.r-project.org/web/packages/dtwclust/vignettes/dtwclust.pdf) if you want more information. – Alexis Aug 26 '19 at 17:03
  • In general, it is worth going over this: https://cran.r-project.org/web/views/MissingData.html – Tal Galili Aug 28 '19 at 07:59
  • If you have missing values, how good will your clustering be? Its distance based, and the distances suffer from missing values. So you'd want to first solve the missing values... – Has QUIT--Anony-Mousse Sep 01 '19 at 06:46

0 Answers0