4

I have a following data set:

structure(list(time = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L), 
x = c(40.8914337158203, 20.0796813964844, 13.9093618392944, 
17.1513957977295, 18.5109558105469, 40.7868537902832, 19.9750995635986, 
13.804780960083, 16.8376483917236, 18.4063758850098, 40.6822700500488, 
19.7659358978271, 13.7001991271973, 16.6284866333008, 18.3017921447754, 
40.5776901245117, 19.66135597229, 13.5956182479858, 16.3147411346436, 
18.1972122192383, 40.5776901245117, 19.5567722320557, 13.4910354614258, 
16.1055774688721, 17.9880485534668), y = c(0.603550314903259, 
-8.24852085113525, 9.65680503845215, -19.0118350982666, 6.43787002563477, 
0.704141974449158, -8.34911251068115, 9.75739574432373, -19.2130165100098, 
6.43787002563477, 0.704141974449158, -8.44970417022705, 9.75739574432373, 
-19.5147914886475, 6.43787002563477, 0.704141974449158, -8.65088748931885, 
9.85798835754395, -19.8165683746338, 6.33727836608887, 0.704141974449158, 
-8.85207080841064, 9.85798835754395, -20.1183433532715, 6.33727836608887
), object = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 
2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -25L), .Names = c("time", 
"x", "y", "object"))

Now, I would like to calculate a convex hull (using chull function) for each value of time and store it within the same dataset (as I would like to make a plot with ggplot2 then). I can use chull for each time value using with

chull(filter(data_sample, time == 1)$x, filter(data_sample, time == 1)$y)

which returns a vector of 4 3 1. So I thought that I can group by time firstly and calculate convex hull points within groups with something like

data_sample %>% group_by(time) %>% summarise(pts = chull(data_sample$x, data_sample$y))

The problem is that I cannot store a vector in a row. Storing each of vertices in separate column would be an option, but the following

data_sample %>% group_by(time) %>% summarise(pt1 = chull(data_sample$x, data_sample$y)[1])

doesn't give reasonable results. So my questions are: 1. How can I store a vector for each row within one column? I have read that tibbles can actually have a list column, but how can I create that in my case? 2. What's wrong with my attempt to calculate chull within each group?

  • (extra question, if I may) Why actually data_sample %>% filter(time == 1) %>% chull(.$x, .$y) doesn't work? Is this because chull is not design to work with pipes and dplyr?
Kuba_
  • 886
  • 6
  • 22

3 Answers3

4

Since chull is giving you indices on the original data, you probably want to preserve the coordinates as you go, which means you probably should not be using summarize. I suggest you go with the "nested" concept as done with tidyr. The first step is nesting your data:

library(tidyr)
data_sample %>%
  group_by(time) %>%
  nest()
# # A tibble: 5 × 2
#    time             data
#   <int>           <list>
# 1     1 <tibble [5 × 3]>
# 2     2 <tibble [5 × 3]>
# 3     3 <tibble [5 × 3]>
# 4     4 <tibble [5 × 3]>
# 5     5 <tibble [5 × 3]>

From here, it's just a matter of calculating the hull (which will return a vector of indices) and then output the relevant rows, in the order provided. This will benefit from the map functions provided by purrr:

library(purrr)
data_sample %>%    data_sample %>%
  group_by(time) %>%
  nest() %>%
  mutate(
    hull = map(data, ~ with(.x, chull(x, y))),
    out = map2(data, hull, ~ .x[.y,,drop=FALSE])
  )
# # A tibble: 5 × 4
#    time             data      hull              out
#   <int>           <list>    <list>           <list>
# 1     1 <tibble [5 × 3]> <int [3]> <tibble [3 × 3]>
# 2     2 <tibble [5 × 3]> <int [3]> <tibble [3 × 3]>
# 3     3 <tibble [5 × 3]> <int [3]> <tibble [3 × 3]>
# 4     4 <tibble [5 × 3]> <int [3]> <tibble [3 × 3]>
# 5     5 <tibble [5 × 3]> <int [3]> <tibble [3 × 3]>

(You should be able to get away with putting both assignments into a single mutate. I

From here, you can turn it into the coordinates you need by removing now-unnecessary columns and unnesting:

data_sample %>%
  group_by(time) %>%
  nest() %>%
  mutate(
    hull = map(data, ~ with(.x, chull(x, y))),
    out = map2(data, hull, ~ .x[.y,,drop=FALSE])
  ) %>%
  select(-data) %>%
  unnest()
# # A tibble: 15 × 5
#     time  hull        x           y object
#    <int> <int>    <dbl>       <dbl>  <int>
# 1      1     4 17.15140 -19.0118351      4
# 2      1     3 13.90936   9.6568050      3
# 3      1     1 40.89143   0.6035503      1
# 4      2     4 16.83765 -19.2130165      4
# 5      2     3 13.80478   9.7573957      3
# 6      2     1 40.78685   0.7041420      1
# 7      3     4 16.62849 -19.5147915      4
# 8      3     3 13.70020   9.7573957      3
# 9      3     1 40.68227   0.7041420      1
# 10     4     4 16.31474 -19.8165684      4
# 11     4     3 13.59562   9.8579884      3
# 12     4     1 40.57769   0.7041420      1
# 13     5     4 16.10558 -20.1183434      4
# 14     5     3 13.49104   9.8579884      3
# 15     5     1 40.57769   0.7041420      1

(I kept hull here for demonstration purposes; you probably can select(-data, -hull) above since you'll have what you need, especially if redundant with object.)

For your last question, you could have done either one of these:

filter(data_sample, time == 1) %>%
  with(., chull(x, y))
with(filter(data_sample, time == 1), chull(x, y))
r2evans
  • 141,215
  • 6
  • 77
  • 149
1

You can simply pass chull function inside a list:

df <- df %>% 
  group_by(time) %>% 
  mutate(chull_val = list(chull(x,y)))
YOLO
  • 20,181
  • 5
  • 20
  • 40
1

If you don't want to work with list columns*, you may consider using (the more flexible) data.table.

library(data.table)
setDT(d)
d[d[ , .I[chull(x, y)], by = time]$V1]

Explanation: convert your data to a data.table (setDT(d)). For each time (by = time), calculate the chull indices and select the corresponding rows (.I) (see here).


If you want to plot the chull polygons, you need to add the first index to close the polygon.

d2 <- d[ , {

  # for each time (by = time):
  # compute the indices lying on the convex hull  
  ix <- chull(x, y)

  # use indices to select data of each subset (.SD)
  # possibly also add the first coordinate to close the polygon for plotting   
  .SD[c(ix, ix[1])]}, by = time]


# plot chull and original polygons
library(ggplot2) 
ggplot(d2, aes(x, y, fill = factor(time))) +
  geom_polygon(alpha = 0.2) +
  geom_polygon(data = d, alpha = 0.2)

enter image description here


*Related dplyr issues: Summarising verbs with variable-length outputs, Optional parameter to control length of summarise.

Henrik
  • 65,555
  • 14
  • 143
  • 159