4

I have an ordered vector, such as:

c(2, 2.8, 2.9, 3.3, 3.5, 4.7, 5.5, 7.2, 7.3, 8.7, 8.7, 10)

I want to not only remove duplicates (which is easy with unique()), but also to average values which are too close to each other, based on a closeness threshold.

So for the above example, if the difference between two values is, say, <= 0.4, average them. The vector should become:

c(2, 2.85, 3.4, 4.7, 5.5, 7.25, 8.7, 10)

The check should be performed by pairs of numbers, up to when there is no more averaging to do.

EDIT: pay attention to the fact that 2.9 and 3.3 should not be averaged, because 2.9 is already being averaged with 2.8 and once this has been done, it's distance with 3.3 is higher than 0.4. So the cluster 2.8, 2.9, 3.3, 3.5 ends up being 2.85, 3.4 and not 3.125.

Is there any simple way of doing this?

AF7
  • 3,160
  • 28
  • 63
  • 1
    The `cumsum(...diff(...` idiom may be used to create a grouping variable. This might be a canonical Q&A: [How to partition a vector into groups of regular, consecutive sequences?](http://stackoverflow.com/questions/5222061/how-to-partition-a-vector-into-groups-of-regular-consecutive-sequences). Just set your desired `diff`erence between consecutive numbers. – Henrik May 10 '17 at 08:14
  • @Henrik You mean something like `split(v,cumsum(c(1,diff(v)>=0.4)))`, or something like using `plyr::round_any()`. EDIT: I see now mt1022's answer. – AF7 May 10 '17 at 08:28
  • You don't need to `split`. As long as you have a grouping variable, there is a plethora of methods (in `base`, `data.table`, `dplyr`) to summarize grouped data. – Henrik May 10 '17 at 08:32
  • Sorry, I think I misunderstood your misunderstood your question. Reading more carefully (including your edit), it seems like this needs to be solved recursively. Cheers. – Henrik May 10 '17 at 09:22
  • @Henrik liborm's approach seems to work for me – AF7 May 10 '17 at 09:22

1 Answers1

2

What you want to do is basically to cluster the input vector (with threshold) and then calculate a summary statistic for each cluster. Like this:

library(tidyverse)

data.frame(
  nums = c(2, 2.8, 2.9, 3.3, 3.5, 4.7, 5.5, 7.2, 7.3, 8.7, 8.7, 10)) %>%
  mutate(group = nums %>% dist %>% hclust %>% cutree(h=.4)) %>%
  group_by(group) %>%
  summarise(result = mean(nums)) %>%
  .$result

You can take it apart by removing the parts divided by the magrittr %>% operator from the back. Take care with larger vectors, because dist is O(N^2).

liborm
  • 2,634
  • 20
  • 32