3

I have a data.frame of (sub)string positions within a larger string. The data contains the start of a (sub)string and it's length. The end position of the (sub)string can be easily calculated.

data1 <- data.frame(start = c(1,3,4,9,10,13),
                   length = c(2,1,3,1,2,1)
                   )

data1$end <- (data1$start + data1$length - 1)

data1
#>   start length end
#> 1     1      2   2
#> 2     3      1   3
#> 3     4      3   6
#> 4     9      1   9
#> 5    10      2  11
#> 6    13      1  13

Created on 2019-12-10 by the reprex package (v0.3.0)

I would like to 'compress' this data.frame by summarizing continuous (sub)strings (strings that are connected with each other) so that my new data looks like this:

data2 <- data.frame(start = c(1,9,13),
                   length = c(6,3,1)
                   )

data2$end <- (data2$start + data2$length - 1)

data2
#>   start length end
#> 1     1      6   6
#> 2     9      3  11
#> 3    13      1  13

Created on 2019-12-10 by the reprex package (v0.3.0)

Is there preferably a base R solution which gets me from data1 to data2?

M--
  • 25,431
  • 8
  • 61
  • 93
TimTeaFan
  • 17,549
  • 4
  • 18
  • 39

2 Answers2

2

Using dplyr we can do the following:

library(dplyr)

data1 %>% 
  group_by(consecutive = cumsum(start != lag(end, default = 0) + 1)) %>% 
  summarise(start = min(start), length=sum(length), end=max(end)) %>% 
  ungroup %>% select(-consecutive)

#> # A tibble: 3 x 3
#>   start length   end
#>   <dbl>  <dbl> <dbl>
#> 1     1      6     6
#> 2     9      3    11
#> 3    13      1    13
M--
  • 25,431
  • 8
  • 61
  • 93
  • Thanks,This is a very concise approach! I wonder how we would do this in base R. – TimTeaFan Dec 10 '19 at 22:54
  • 2
    @M-- With `base R`, it would be `transform(data1,consecutive = cumsum(start != c(0, end[-length(end)]) + 1)))` and use that in `aggregate` – akrun Dec 10 '19 at 23:02
  • 2
    Or with `by` `do.call(rbind, by(df1, df1$consecutive, FUN = function(x) cbind(start = min(x$start), length = sum(x$length), end = max(x$end))))` – akrun Dec 10 '19 at 23:06
  • @TimTeaFan Please post another answer instead of editing this. Cheers. – M-- Dec 11 '19 at 22:19
  • @M-- sorry, didn't know that not including akrun's base R approach was on purpose (thought it would make it easier for other readers to see both approaches in one answer). – TimTeaFan Dec 11 '19 at 22:56
  • @TimTeaFan no worries. akrun already said that they are fine with posting the answer. You can link to their comment (here: https://stackoverflow.com/questions/59276442/compress-summarize-string-start-and-length-data-in-r/59276506?noredirect=1#comment104760118_59276506) and post an answer. SO has a clean formatting, so it won't be that hard to see the answers side by side. – M-- Dec 11 '19 at 23:03
2
f = cumsum(with(data1, c(0, start[-1] - head(end, -1))) != 1)
do.call(rbind, lapply(split(data1, f), function(x){
    with(x, data.frame(start = start[1],
                       length = tail(end, 1) - start[1] + 1,
                       end = tail(end, 1)))}))
#  start length end
#1     1      6   6
#2     9      3  11
#3    13      1  13
d.b
  • 32,245
  • 6
  • 36
  • 77
  • Thanks for your answer. I especially like the second part with `do.call(rbind, lapply(split...)` as a way to deal with similar problems in base R in general. However, `f` is for me easier to understand the way akrun and M-- defined it below as: `f = with(data1, cumsum(start != c(0, end[-length(end)]) + 1))`. Maybe you could mention it as an alternative. – TimTeaFan Dec 12 '19 at 13:41