compress / summarize string start and length data in R

Question

I have a data.frame of (sub)string positions within a larger string. The data contains the start of a (sub)string and it's length. The end position of the (sub)string can be easily calculated.

data1 <- data.frame(start = c(1,3,4,9,10,13),
                   length = c(2,1,3,1,2,1)
                   )

data1$end <- (data1$start + data1$length - 1)

data1
#>   start length end
#> 1     1      2   2
#> 2     3      1   3
#> 3     4      3   6
#> 4     9      1   9
#> 5    10      2  11
#> 6    13      1  13

^{Created on 2019-12-10 by the reprex package (v0.3.0)}

I would like to 'compress' this data.frame by summarizing continuous (sub)strings (strings that are connected with each other) so that my new data looks like this:

data2 <- data.frame(start = c(1,9,13),
                   length = c(6,3,1)
                   )

data2$end <- (data2$start + data2$length - 1)

data2
#>   start length end
#> 1     1      6   6
#> 2     9      3  11
#> 3    13      1  13

^{Created on 2019-12-10 by the reprex package (v0.3.0)}

Is there preferably a base R solution which gets me from data1 to data2?

@akrun Let's say, this is the actual data: ```(1,2), (3), (4,5,6), (9), (10,11), (13)``` Hope this helps. — M--, Dec 10 '19 at 22:56

score 2 · Answer 1 · answered Dec 10 '19 at 22:51

2

Using dplyr we can do the following:

library(dplyr)

data1 %>% 
  group_by(consecutive = cumsum(start != lag(end, default = 0) + 1)) %>% 
  summarise(start = min(start), length=sum(length), end=max(end)) %>% 
  ungroup %>% select(-consecutive)

#> # A tibble: 3 x 3
#>   start length   end
#>   <dbl>  <dbl> <dbl>
#> 1     1      6     6
#> 2     9      3    11
#> 3    13      1    13

answered Dec 10 '19 at 22:51

M--

25,431
8
61
93

Thanks,This is a very concise approach! I wonder how we would do this in base R. – TimTeaFan Dec 10 '19 at 22:54
2

@M-- With `base R`, it would be `transform(data1,consecutive = cumsum(start != c(0, end[-length(end)]) + 1)))` and use that in `aggregate` – akrun Dec 10 '19 at 23:02
2

Or with `by` `do.call(rbind, by(df1, df1$consecutive, FUN = function(x) cbind(start = min(x$start), length = sum(x$length), end = max(x$end))))` – akrun Dec 10 '19 at 23:06
@TimTeaFan Please post another answer instead of editing this. Cheers. – M-- Dec 11 '19 at 22:19
@M-- sorry, didn't know that not including akrun's base R approach was on purpose (thought it would make it easier for other readers to see both approaches in one answer). – TimTeaFan Dec 11 '19 at 22:56
@TimTeaFan no worries. akrun already said that they are fine with posting the answer. You can link to their comment (here: https://stackoverflow.com/questions/59276442/compress-summarize-string-start-and-length-data-in-r/59276506?noredirect=1#comment104760118_59276506) and post an answer. SO has a clean formatting, so it won't be that hard to see the answers side by side. – M-- Dec 11 '19 at 23:03

score 2 · Accepted Answer · answered Dec 10 '19 at 23:03

2

f = cumsum(with(data1, c(0, start[-1] - head(end, -1))) != 1)
do.call(rbind, lapply(split(data1, f), function(x){
    with(x, data.frame(start = start[1],
                       length = tail(end, 1) - start[1] + 1,
                       end = tail(end, 1)))}))
#  start length end
#1     1      6   6
#2     9      3  11
#3    13      1  13

answered Dec 10 '19 at 23:03

d.b

32,245
6
36
77

Thanks for your answer. I especially like the second part with `do.call(rbind, lapply(split...)` as a way to deal with similar problems in base R in general. However, `f` is for me easier to understand the way akrun and M-- defined it below as: `f = with(data1, cumsum(start != c(0, end[-length(end)]) + 1))`. Maybe you could mention it as an alternative. – TimTeaFan Dec 12 '19 at 13:41

compress / summarize string start and length data in R

2 Answers2