20

Let's say I have the following vector of numbers:

vec = c(1, 2, 3, 5, 7, 8, 9, 10, 11, 12)

I'm looking for a function that will create a string summarizing the list of numbers the way a human would. That is, each run of consecutive numbers (here 1, 2, 3 and 7, 8, 9, 10, 11, 12) is collapsed into its start and end value:

"1-3, 5, 7-12"

How can I do this in R?

Henrik
  • 65,555
  • 14
  • 143
  • 159
CephBirk
  • 6,422
  • 5
  • 56
  • 74
  • See also [Collapse continuous integer runs to strings of ranges](https://stackoverflow.com/questions/14868406/collapse-continuous-integer-runs-to-strings-of-ranges) – Henrik Jun 15 '20 at 22:04

3 Answers3

28

Adding another alternative, you could use a deparseing approach. For example:

deparse(c(1L, 2L, 3L))
#[1] "1:3"

Taking advantage of as.character "deparse"ing a given "list" as input, we could use:

as.character(split(as.integer(vec), cumsum(c(TRUE, diff(vec) != 1))))
#[1] "1:3"  "5"    "7:12"
toString(gsub(":", "-", .Last.value))
#[1] "1-3, 5, 7-12"
alexis_laz
  • 12,884
  • 4
  • 27
  • 37
  • 11
    What sorcery is this? – David Arenburg Jan 06 '16 at 16:10
  • 1
    FWIW: the call to `as.character` is superfluous as `gsub` start by calling it if the input is not of type "character". – Tensibai Jan 06 '16 at 16:31
  • 1
    Using `fixed = TRUE` would definitely speed this up – Rich Scriven Jan 06 '16 at 16:47
  • 2
    @Tensibai : You're right, although, I guess it'll then look like a quiz..! @RichardScriven : I've not benchmarked it, but I think `deparse` is, already, much slower than what can be improved by not replacing with another approach. @DavidArenburg : Debugging in R, leads more often to a "why" than to a "how does this work"? – alexis_laz Jan 06 '16 at 17:17
  • 1
    For those wondering the why: it's [here](https://github.com/wch/r-source/blob/b156e3a711967f58131e23c1b1dc1ea90e2f0c43/src/main/deparse.c#L1325-L1329) that deparse do a specific thing on increasing integers vectors to return them as their range notation. – Tensibai Jan 07 '16 at 08:40
21

I assume that the vector is sorted as in the example. If not use vec <- sort(vec) beforehand.

Edit note: @DavidArenburg spotted a mistake in my original answer where c(min(x), x) should actually be c(0, x). Since we know now that we always need to add a 0 in the first place, we can omit the first step of creating x and do it "on the fly". The original answer and additional options are now edited to reflect that (you can check the edit history for the original post). Thanks David!

A note on calls to unname: I used unname(sapply(...)) to ensure that the resulting vector is not named, otherwise it would be named 0:(n-1) where n equals the length of new_vec. As @Tensibai noted correctly in the comments, this doesn't matter if the final aim is to generate a length-1 character vector as produced by running toString(new_vec) since vector names will be omitted by toString anyway.


One option (possibly not the shortest) would be:

new_vec <- unname(sapply(split(vec, c(0, cumsum(diff(vec) > 1))), function(y) {
  if(length(y) == 1) y else paste0(head(y, 1), "-", tail(y, 1))
}))

Result:

new_vec
#[1] "1-3"  "5"    "7-12"
toString(new_vec)
#[1] "1-3, 5, 7-12"

Thanks to @Zelazny7 it can be shortened by using the range function:

new_vec <- unname(sapply(split(vec, c(0, cumsum(diff(vec) > 1))), function(y) {
    paste(unique(range(y)), collapse='-')
}))

Thanks to @DavidArenburg it can be further shortened by using tapply instead of sapply + split:

new_vec <- unname(tapply(vec, c(0, cumsum(diff(vec) > 1)), function(y) {
  paste(unique(range(y)), collapse = "-")
}))
talat
  • 68,970
  • 21
  • 126
  • 157
  • 3
    Could use `paste(unique(range(y)), collapse='-')` instead of `head` and `tail` – Zelazny7 Jan 06 '16 at 15:28
  • @Zelazny7, that's a nice idea, thanks. I'll add it as another option – talat Jan 06 '16 at 15:29
  • 1
    For the `unname` calls, as long as you wrap it in `toString` after, they are unnecessary as `toString` or `paste0(..,collapse=", ")` will not take the names anyway. – Tensibai Jan 06 '16 at 16:08
7

EDITS: I sped up docendo's code quite a bit by sorting the vector first, so now they are actually on equal footing.

I also added alexis' approach.

readable_integers <- function(integers)
{
  integers <- sort(unique(integers))
  group <- cumsum(c(0, diff(integers)) != 1)

  paste0(vapply(split(integers, group),
           function(x){
             if (length(x) == 1) as.character(x)
             else paste0(range(x), collapse = "-")
           },
           character(1)),
           collapse = "; ")
}

library(microbenchmark)
vec = c(1, 2, 3, 5, 7, 8, 9, 10, 11, 12)
microbenchmark(
  docendo = {vec <- sort(vec)
    x <- cumsum(diff(vec) > 1)
   toString(tapply(vec, c(min(x), x), function(y) paste(unique(range(y)), )collapse = "-"))
  },
  Benjamin = readable_integers(vec),
  alexis = {vec <- sort(vec)
            as.character(split(as.integer(vec), cumsum(c(TRUE, diff(vec) != 1))))
            toString(gsub(":", "-", .Last.value))}
)

Unit: microseconds
     expr     min       lq     mean  median       uq     max neval
  docendo 205.273 220.3755 230.3134 228.293 235.4780 467.142   100
 Benjamin 121.991 128.4420 135.5302 133.574 143.3980 161.286   100
   alexis 121.698 128.0030 137.0374 136.507 143.3975 169.790   100

set.seed(pi)
vec = sample(1:1000, 900)

set.seed(pi)
vec = sample(1:1000, 900)

microbenchmark(
  docendo = {vec <- sort(vec)
   x <- cumsum(diff(vec) > 1)
   toString(tapply(sort(vec), c(min(x), x), function(y) paste(unique(range(y)), collapse = "-")))
  },
  Benjamin = readable_integers(vec),
  alexis = {vec <- sort(vec)
            as.character(split(as.integer(vec), cumsum(c(TRUE, diff(vec) != 1))))
            toString(gsub(":", "-", .Last.value))}
)
Unit: microseconds
     expr      min        lq      mean    median        uq      max neval
  docendo 1307.294 1353.7735 1420.3088 1379.7265 1427.8190 2554.473   100
 Benjamin  615.525  626.8155  661.2513  638.8385  665.3765 1676.493   100
   alexis  799.684  808.3355  866.1516  820.0650  833.2615 1974.138   100
Benjamin
  • 16,897
  • 6
  • 45
  • 65
  • 1
    I think replacing the outer paste0 by toString makes it cleaner (for the same result), and you're not calling unname, which has no interest in fact when you wrap the result inside a paste0 or toString call, so maybe it's where the gain comes from. – Tensibai Jan 06 '16 at 16:01
  • No real performance change, but using `toString` might take away the flexibility of choosing your collapse character (for instance, if you wanted "1-3; 5; 7-12"). So it seems a matter of preference and utility. – Benjamin Jan 06 '16 at 16:09
  • It's indeed just that it saves the collapse=", " in this parcticular case. I did think it worth being said :) – Tensibai Jan 06 '16 at 16:12
  • I agree. `toString` is a function I seem to have forgotten. It may become more frequent in my future coding. – Benjamin Jan 06 '16 at 16:13
  • Thanks. I forgot to wrap the first set of results in `toString` (or `paste` or whatever). These all return the desired strings now (unless there's another problem I'm not understanding) – Benjamin Jan 06 '16 at 17:08