Sequence length encoding using R

Question

Is there a way to encode increasing integer sequences in R, analogous to encoding run lengths using run length encoding (rle)?

I'll illustrate with an example:

Analogy: Run length encoding

r <- c(rep(1, 4), 2, 3, 4, rep(5, 5))
rle(r)
Run Length Encoding
  lengths: int [1:5] 4 1 1 1 5
  values : num [1:5] 1 2 3 4 5

Desired: sequence length encoding

s <- c(1:4, rep(5, 4), 6:9)
s
[1] 1 2 3 4 5 5 5 5 6 7 8 9

somefunction(s)
Sequence lengths
  lengths: int [1:4] 5 1 1 5
  value1 : num [1:4] 1 5 5 5

Edit 1

Thus, somefunction(1:10) will give the result:

Sequence lengths
  lengths: int [1:1] 10
  value1 : num [1:1] 1

This results means that there is an integer sequence of length 10 with starting value of 1, i.e. seq(1, 10)

Note that there isn't a mistake in my example result. The vector in fact ends in the sequence 5:9, not 6:9 which was used to construct it.

My use case is that I am working with survey data in an SPSS export file. Each subquestion in a grid of questions will have a name of the pattern paste("q", 1:5), but sometimes there is an "other" category which will be marked q_99, q_other or something else. I wish to find a way of identifying the sequences.

Edit 2

In a way, my desired function is the inverse of the base function sequence, with the start value, value1 in my example, added.

lengths <- c(5, 1, 1, 5)
value1 <- c(1, 5, 5, 5)

s
[1] 1 2 3 4 5 5 5 5 6 7 8 9
sequence(lengths) + rep(value1-1, lengths) 
[1] 1 2 3 4 5 5 5 5 6 7 8 9

Edit 3

I should have stated that for my purposes a sequence is defined as increasing integer sequences as opposed to monotonically increasing sequences, e.g. c(4,5,6,7) but not c(2,4,6,8) nor c(5,4,3,2,1). However, any other integer can appear between sequences.

This means a solution should be able to cope with this test case:

somefunction(c(2, 4, 1:4, 5, 5))
    Sequence lengths
      lengths: int [1:4] 1 1 5 1
      value1 : num [1:4] 2 4 1 5

In the ideal case, the solution can also cope with the use case suggested originally, which would include characters in the vector, e.g.

somefunction(c(2, 4, 1:4, 5, "other"))
    Sequence lengths
      lengths: int [1:5] 1 1 5 1 1
      value1 : num [1:5] 2 4 1 5 "other"

Andrie, I am still not clear how your sequence encoding works. Where do the values come from, and what do the lengths imply? +1 for laying it out with an example, but you can make it more clear. — Ramnath, Aug 16 '11 at 11:53
Please define: "sequence". :) I'm with Ramnath - it's not quite making sense. — Iterator, Aug 16 '11 at 12:08
@Ramnath I hope the edit makes it more clear. In the sequence 1:10 the lenght is 10, and the value1 is 1. In other words you can pass these parameters to seq.int to reconcstruct the original vector. For example `seq.int(1, lenght.out=10)` — Andrie, Aug 16 '11 at 12:16
@Iterator. A sequence is defined the same as the `seq` function in R. So, 1:5 is the integer sequence from 1 to 5, i.e. `c(1,2,3,4,5)` — Andrie, Aug 16 '11 at 12:17
Got it. So, is this is simply a way of deciding when to switch between `rep` and `seq` in order to reproduce a given vector? (Where it seems that `rep` has a default replication of 1?) If so, then that is an interesting encoding question. — Iterator, Aug 16 '11 at 12:19
@Iterator Just `seq`, not `rep`. Any repeated values will simply be repeated as elements in the results vector, i.e. `seq.int(..., length.out=1)` — Andrie, Aug 16 '11 at 12:22
A suggestion for people answering: a solution that doesn't use `diff`, and uses logical comparisons instead is generalizable (to non-numeric vectors) and could be much faster. — Iterator, Aug 16 '11 at 13:09
@Iterator: how would you define a sequence in a non-numerical case? — Nick Sabbe, Aug 16 '11 at 13:48
@Nick: I have a beautiful definition, but unfortunately the margins of this website are not large enough for me to express it. :) Touché, I believe you are correct. — Iterator, Aug 16 '11 at 13:57
Can you show us the output of colnames(your_df)? So we can see how the question numbers are labelled? I deal with this frequently with market research data files as well. — Brandon Bertelsen, Aug 16 '11 at 14:29

Joris Meys · Accepted Answer · 2011-08-16T13:45:14.607

9

EDIT : added control to do the character vectors as well.

Based on rle, I come to following solution :

somefunction <- function(x){

    if(!is.numeric(x)) x <- as.numeric(x)
    n <- length(x)
    y <- x[-1L] != x[-n] + 1L
    i <- c(which(y|is.na(y)),n)

    list(
      lengths = diff(c(0L,i)),
      values = x[head(c(0L,i)+1L,-1L)]
    )

}

> s <- c(2,4,1:4, rep(5, 4), 6:9,4,4,4)

    > somefunction(s)
    $lengths
    [1] 1 1 5 1 1 5 1 1 1

    $values
    [1] 2 4 1 5 5 5 4 4 4

This one works on every test case I tried and uses vectorized values without ifelse clauses. Should run faster. It converts strings to NA, so you keep a numeric output.

> S <- c(4,2,1:5,5, "other" , "other",4:6,2)

> somefunction(S)
$lengths
[1] 1 1 5 1 1 1 3 1

$values
[1]  4  2  1  5 NA NA  4  2

Warning message:
In somefunction(S) : NAs introduced by coercion

edited Aug 16 '11 at 13:45

answered Aug 16 '11 at 12:46

Joris Meys

106,551
31
221
263

why are 2 and 4 being counted as part of separate subsequences? – Ramnath Aug 16 '11 at 13:13
Because @Andrie said that a sequence is defined like in R by using `:`. and 4 does not follow on 2. – Joris Meys Aug 16 '11 at 13:18
+1 Vectorized - nice! The way things should be. It's even called `somefunction`. Gotta go downvote Andrie for suggesting a bad name. Just kidding. – Iterator Aug 16 '11 at 14:06

Ramnath · Answer 2 · 2011-08-16T14:59:45.367

5

Here is my solution

diff_s = which(diff(s) != 1)
lengths = diff(c(0, diff_s, length(s)))
values  = s[c(1, diff_s + 1)]

EDIT: function to take care of strings too

sle2 = function(s){
  s2 = as.numeric(s)
  s2[is.na(s2)] = 100 + as.numeric(factor(s[is.na(s2)]))
  diff_s2 = which(diff(s2) != 1)
  lengths = diff(c(0, diff_s2, length(s)))
  values  = s[c(1, diff_s2 + 1)]
  return(list(lengths = lengths, values = values))
}

sle2(c(4,2,1:5,5, "other" , "other",4:6,2, "someother", "someother"))

lengths
 [1] 1 1 5 1 1 1 3 1 1 1

$values
 [1] "4"   "2"  "1"   "5"  "other" "other"  "4"   "2"  "someother" "someother"

Warning message:
In sle2(c(4, 2, 1:5, 5, "other", "other", 4:6, 2, "someother", "someother")) :
  NAs introduced by coercion

edited Aug 16 '11 at 14:59

answered Aug 16 '11 at 12:43

Ramnath

54,439
16
125
152

try s <- c(2,4,1:4, rep(5, 4), 6:9,4,4,4) on it. Doesn't work. – Joris Meys Aug 16 '11 at 12:46
nope. Output should be 1 1 5 1 1 5 1 1 1 for the lengths, but it doesn't give that. ( And please, use `<-` instead of `=` for assignments. I know both work, but still.. ) – Joris Meys Aug 16 '11 at 12:54
nope. here is the sequence (2, 4), (1, 2, 3, 4, 5), (5), (5), (5, 6, 7, 8, 9), (4), (4), (4) with lengths being `2, 5, 1, 1, 5, 1, 1, 1` and starting values `2, 1, 5, 5, 5, 4, 4, 4` – Ramnath Aug 16 '11 at 12:56
I think there is a problem that you count on having a lower number after the end of a sequence. As such, your lengths are not correct with (e.g.): `s <- c(2,4,1:4, rep(5, 4), 6:9,12,4,11)`. – Nick Sabbe Aug 16 '11 at 13:20
@Nick, Joris. this should fix it as I have a better understanding of subsequence now. – Ramnath Aug 16 '11 at 13:55

Nick Sabbe · Answer 3 · 2011-08-16T12:42:40.983

4

You could use this for a start (given you s above):

s2<-c(0, diff(s))
s3<-ifelse((c(s2[-1], 0)==1) & (s2!=1), 1, s2)
rle(ifelse(s3==1, -1, seq_along(s3)))

It doesn't return the values yet, there are probably easy enough ways to adpat the code. At least you have the sequence lengths, so you can easily retrieve the starting values for the sequences.

edited Aug 16 '11 at 12:42

answered Aug 16 '11 at 12:09

Nick Sabbe

11,684
1
43
57

I think that the values may referenced by indices of non-zero elements of `s2` or from the same for a second-pass diff on `s2`. I'm still getting my head around the original problem of the lengths; the values seem easier to me... – Iterator Aug 16 '11 at 12:16
That's going to cause trouble with c(2,4,1:4,5,5,...) – Joris Meys Aug 16 '11 at 12:36

score 3 · Answer 4 · answered Aug 16 '11 at 13:08

3

How about:

sle <- function(s)
{
    diffs <- which(diff(s)!=1)
    lengths <- c(diffs[1],diff(diffs),length(s)-diffs[length(diffs)])
    value1 <- s[c(1,diffs+1)]
    cat("", "Sequence Length Encoding\n", " lengths:")
    str(lengths)
    cat("  value1:")
    str(value1)
}


sle(s)
 Sequence Length Encoding
  lengths: int [1:4] 5 1 1 5
  value1: num [1:4] 1 5 5 5

sle(c(2,4,1:4,rep(5,4),6:9,4,4,4))
 Sequence Length Encoding
  lengths: int [1:9] 1 1 5 1 1 5 1 1 1
  value1: num [1:9] 2 4 1 5 5 5 4 4 4

answered Aug 16 '11 at 13:08

James

65,548
14
155
193

why are 2 and 4 being counted as part of separate sub sequences? the key is how to determine subsequences. my assumption is that the breaking point is when the numbers don't increase. what is yours? – Ramnath Aug 16 '11 at 13:12
@Ramnath : as they both form a sequence of length 1? – Joris Meys Aug 16 '11 at 13:17
@Ramnath I understood that the sequences of interest were consecutive integers. – James Aug 16 '11 at 13:18
This was my original intention, so your solution does what I expected. @Ramnath, I have edited my question to specify this explicitly. Now, for an extra challenge, how to cope with characters in the vector? – Andrie Aug 16 '11 at 13:33
@Andrie Don't think it would work with a mixture of numbers and characters, but you could always change the characters to a negative integer to highlight them. – James Aug 16 '11 at 13:38
+1 I think this works. Rather than changing character to negative integers, perhaps change them to NA, and keep store them as an additional list element. – Andrie Aug 16 '11 at 13:49

score 3 · Answer 5 · edited Aug 17 '11 at 13:21

3

Here's an enhancement to Joris Meys's solution. Consider this a solution to a future problem :-) .

Carl

seqle <- function(x,incr=1) {
    if(!is.numeric(x)) x <- as.numeric(x)
    n <- length(x)
    #y <- x[-1L] != x[-n] + 1L
    y <- x[-1L] != x[-n] + incr
    i <- c(which(y|is.na(y)),n)
    list( lengths = diff(c(0L,i)),  values = x[head(c(0L,i)+1L,-1L)])
}

edited Aug 17 '11 at 13:21

Ben Bolker

211,554
25
370
453

answered Aug 17 '11 at 13:11

Carl Witthoft

20,573
9
43
73

Thanks, Ben, for improving the formatting. I should have done that myself. BTW, in case people didn't know, this, and Joris' code, are exactly the code in base::rle with the addition of the "+incr" offset to the formula for y. – Carl Witthoft Aug 17 '11 at 16:54

score 0 · Answer 6 · answered Aug 16 '11 at 14:22

0

"My use case is that I am working with survey data in an SPSS export file. Each subquestion in a grid of questions will have a name of the pattern paste("q", 1:5), but sometimes there is an "other" category which will be marked q_99, q_other or something else. I wish to find a way of identifying the sequences."

I usually do something like this when I'm pulling data from confirmit, DASH, SPSS, SAS, MySQL or whatever depending on the source it always gets punted into a data.frame():

surv.pull <- function(dat, pattern) {
  dat <- data.frame(dat[,grep(pattern,colnames(dat))],check.names=F)
return(dat)
}

If you use pattern like [q][_][9][9] you can decide to pull a data.frame of other data spaces by or not by adding "." to the end [q][_][9][9]. so that it pulls q_99whatever

Most of my data columns are in the form like this q8a.1, .3, .4, .5, .6, .7, .8, ... so surv.pull(dat, "[q][8][a].") would pull them all, including the other if there was a specify. Obviously, using regex you could decide whether or not to pull the other.

Alternatively, the general convention is to push other specify questions to the end of the data space, so a quick df <- df[-ncol(df)] would drop it or other_list <- df[ncol(df)] would save it.

answered Aug 16 '11 at 14:22

Brandon Bertelsen

43,807
34
160
255

I am familiar with this design pattern, and use it myself. This is not the issue. What I want to achieve is to separate the initial sequence from any "other" columns at the end. So, the columns names could be `c("q_1", "q_2", "q_3", "q_99", "q_99_other")`. My question tries to find a way of separating the initial sequence of 1:3 from the 99. A grep pattern can't easily do that. – Andrie Aug 16 '11 at 14:49
You could always replace 99 with another identifier in the colnames. I usually do that right at the beginning and push them into a verb file for later review. `.[9][9].` (99 is sometimes an other specify column for me). Unless 99 is an actual question number for you? (Omnibus?) – Brandon Bertelsen Aug 16 '11 at 15:10
Yes, but what if it is "q_98" or anything else. The general pattern is that there is probably a sequence at the start (sometimes an interrupted sequence), and often a motley collection of "other" questions at the end, where "other" is coded differently depending on the panel/omnibus/fieldwork supplier. I have found that 98 or 99 is often a boolean/numeric indicating that this option was ticked, with "q_99_other" containing the text. – Andrie Aug 16 '11 at 15:15
`names(df[,sapply(df,!is.numeric)])` would give you a listing of all the verbatim columns. So you could probably use that to avoid dealing with the sequencing entirely. – Brandon Bertelsen Aug 19 '11 at 08:11

Sequence length encoding using R

6 Answers6

Linked

Related