0

My data are in a coordinate system (let's say they are integers for simplicity), and each point will be classified into predefined groups (that are not related to the actual location of the point). When I plot the points out, some of the points form clusters. I want to visualize these clusters.

In the case of R, I can use ggplot and set the color of the point to the grouping variable. However, this is not ideal for my case, as the color/ fill of the point is already occupied by another variable of interest. Therefore, I would like to "draw a boundary" of the clusters.

I searched around and find the chull function in R may be useful, but its performance is not perfectly fit what I want. One problem is it will always group theIn short, I would like to accomplish the following tasks:

  1. Plot points will fill according to the variable of interest (value)
  2. Draw a boundary for each point cluster (based on the location) for each predefined group.
  3. (Optionally and ideally), the boundary drawn can be both concave and convex, rather than as a polygon, so that the boundary will be sensitive to a single point that is still "connected" to the cluster.
  4. (Optionally and ideally), the boundary drawn can be colored according to the predefined grouping variable.

Below are a toy dataset and my failed trials.

set.seed(123)
data <- matrix(data = c(1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3,
                1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3,
                1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3,
                1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 
                1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 
                1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 
                1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 
                1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 
                1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 3,
                1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 3, 
                1, 2, 2, 4, 4, 4, 4, 4, 4, 4, 3, 3, 
                4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4), ncol = 12) %>% 
  as_tibble(rownames = "x") %>% 
  gather(y, group, -x) %>% 
  mutate(x = as.numeric(x),
         y = as.numeric(gsub("V", "", y)),
         group = factor(group)) %>% 
  mutate(value = rnorm(144))

You may find the groupings and clustering of the point like this:

data %>% 
  ggplot(aes(x = x, y = y, color = group)) +
  geom_point() +
  theme(axis.text = element_blank()) +
  theme(axis.text = element_blank(),
        legend.title = element_text(size = 5),
        legend.text = element_text(size = 3),
        legend.key.size = unit(0.3, "lines"),
        legend.spacing.y = unit(0.1, "lines"),
        legend.margin = margin(0.2,0,0,0, unit="lines")) +
  guides(color = guide_legend(override.aes = list(size = 0.5)),
         fill = guide_legend(override.aes = list(size = 0.5)))

enter image description here

However, as mentioned, I need to save the color/ fill for the value variable, therefore, I tried (after reading reference here):

hull_data <- data %>% 
  group_by(group) %>% 
  slice(chull(x, y))

data %>% 
  ggplot() +
  geom_point(aes(x = x, y = y, color = value)) +
  geom_polygon(data = hull_data, aes(x = x, y = y, fill = group), alpha = 0.5) +
  theme(axis.text = element_blank(),
        legend.title = element_text(size = 5),
        legend.text = element_text(size = 3),
        legend.key.size = unit(0.3, "lines"),
        legend.spacing.y = unit(0.1, "lines"),
        legend.margin = margin(0.2,0,0,0, unit="lines")) +
  guides(color = guide_legend(override.aes = list(size = 0.5)),
         fill = guide_legend(override.aes = list(size = 0.5)))

However, there are a couple of issues. First, this code has a single polygon for each predefined group; Second, it is not sensitive to those single points that resemble concave shapes.

enter image description here

I can somehow solve the first problem by hard coding, but I want to know if there is a better method that can be used for big data.

hull_data2 <- data %>% 
  mutate(group2 = group) %>% 
  mutate(group2 = ifelse(group == 1 & x > 7.5, "a", ifelse(group == 3 & y >7.5, "b", group))) %>% 
  unite(final_group, group, group2) %>% 
  group_by(final_group) %>% 
  slice(chull(x, y))

data %>% 
  ggplot() +
  geom_point(aes(x = x, y = y, color = value)) +
  geom_polygon(data = hull_data2, aes(x = x, y = y, fill = final_group), alpha = 0, linetype = "dashed", color = "black") +
  theme(axis.text = element_blank(),
        legend.title = element_text(size = 5),
        legend.text = element_text(size = 3),
        legend.key.size = unit(0.3, "lines"),
        legend.spacing.y = unit(0.1, "lines"),
        legend.margin = margin(0.2,0,0,0, unit="lines")) +
  guides(color = guide_legend(override.aes = list(size = 0.5)),
         fill = guide_legend(override.aes = list(size = 0.5)))

enter image description here

As Shown above, this can separate two clusters that are spatially distinct. However, I then noticed another issue, that is I cannot use the color twice.

I prefer plotting the graph in R, but I am open to any other languages that can complete the task. Thanks!

William Wong
  • 453
  • 2
  • 9

0 Answers0