3

I am playing around with binary data.

I have test data in columns in the following manner:

       A   B   C   D   E   F   G   H   I   J   K   L   M   N
       -----------------------------------------------------
       1   1   1   1   1   1   1   1   1   0   0   0   0   0
       0   0   0   0   1   1   1   0   1   1   0   0   1   0

1 is indicating that the system was on and 0 indicating that the system was off.

I have a way to figure out a way to summarize the gaps between the on/off transition of these systems.

For example:

  • for the first row, it stops working after I

  • for the second row, it works from E to G and then works again in I and M but is off during others.

I see my result in the following form (table1)

    row-number   value      grp_num     num       Range
    ------------  -----     --------    ------    ------ 
    1              1           1          9         A-I
    1              0           2          5         J-N
    2              0           1          4         A-D
    2              1           2          3         E-G
    2              0           3          1         H-H
    2              1           4          2         I-J
    2              0           5          2         K-L
    2              1           6          1         M-M
    2              0           7          1         N-N

The code I used is this:

table1 <- test[,-c(1)] %>% 
  rownames_to_column() %>%
  gather(col,val,-rowname) %>%
  group_by(rowname) %>%
  mutate(grp_num = cumsum(val != lag(val, default = -99))) %>%
  group_by(rowname,val,grp_num) %>%
  dplyr::summarise(num = n(),
                   range = paste0(first(col), "-", last(col)))

My question here is if my data had blank entries, how can I exclude them from being a part of a group.

A   B   C   D   E   F   G   H   I   J   K   L   M   N
-----------------------------------------------------
    1   1   1   1   1   1   1   1   0   0   0   0   0
                1   1   1   0   1   1   0   0   1   0

The expected result is very similar but excluding the blank values

         row-number   value      grp_num     num       Range
        ------------  -----     --------    ------    ------ 
        1              1           1          8         B-I
        1              0           2          5         J-N
        2              1           1          3         E-G
        2              0           2          1         H-H
        2              1           3          2         I-J
        2              0           4          2         K-L
        2              1           5          1         M-M
        2              0           6          1         N-N

M--
  • 25,431
  • 8
  • 61
  • 93
Jeet
  • 188
  • 12
  • 1
    Why `I-N`? `I` is on but you are including it in the off time. In the second row though, you have `H-H` and not including previous columns in your off time. Your output is inconsistent. – M-- Oct 25 '19 at 21:52
  • @M-- changes made. Thanks – Jeet Oct 25 '19 at 22:05
  • You should provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). I am not sure how exactly you have blanks in your data. If you share `dput(head(df, 2))` with us as an [edit] to your question, I will try to see if I can be any of help. p.s. Sorry for delayed response to your update and not including these points in my first comment. Cheers. – M-- Oct 27 '19 at 18:41

2 Answers2

0
# create the test table
df2 <- read.table(text=
"A   B   C   D   E   F   G   H   I   J   K   L   M   N
NA    1   1   1   1   1   1   1   1   0   0   0   0   0
NA   NA   NA  NA   1   1   1   0   1   1   0   0   1   0
", header=TRUE)

find_range <- function(names, values) {
  out = list(
    "0" = c(), 
    "1" = c(),
    "NA" = c() # numbers and NAs can't be the keys of a list, so we need a workaround
  )
  if (is.na(values[1])) {
    current_value = "NA"
  } else {
    current_value = as.character(values[1])
  }
  earliest_name = names[1]
  for (i in 1:length(values)) {
    if (is.na(values[i])) {
      if (current_value != "NA") {
        out[[current_value]] = c(out[[current_value]], paste0(earliest_name, "-", names[i-1]))
        current_value = "NA"
        earliest_name = names[i]
      }
    } else if (as.character(values[i]) != current_value) {
      out[[current_value]] = c(out[[current_value]], paste0(earliest_name, "-", names[i-1]))
      current_value = as.character(values[i])
      earliest_name = names[i]
    }
  }
  out[[current_value]] = c(out[[current_value]], paste0(earliest_name, "-", names[length(names)]))

  return(out)
}


df2 %>% 
  rownames_to_column() %>%
  pivot_longer(-rowname) %>%
  group_by(rowname) %>%
  mutate(range = list(find_range(name, value))) %>%
  select(-name, -value) %>%
  distinct() %>%
  unnest_longer(range) %>%
  unnest(range) %>%
  separate(range, into = c("start", "end"), sep = "-") %>%
  mutate(grp_num = row_number(), 
        num = match(end, LETTERS) - match(start, LETTERS) + 1, 
        Range = paste0(start, "-", end)) %>%
  select(`row-number` = rowname,
  value = range_id,
  grp_num, 
  num, 
  Range)

   `row-number` value grp_num   num Range
   <chr>        <chr>   <int> <dbl> <chr>
 1 1            0           1     5 J-N  
 2 1            1           2     8 B-I  
 3 1            NA          3     1 A-A  
 4 2            0           1     1 H-H  
 5 2            0           2     2 K-L  
 6 2            0           3     1 N-N  
 7 2            1           4     3 E-G  
 8 2            1           5     2 I-J  
 9 2            1           6     1 M-M  
10 2            NA          7     4 A-D  
# you can filter out the NAs if you want, but I think they're neat
Mark
  • 7,785
  • 2
  • 14
  • 34
0

You can try the code below

library(dplyr)
library(data.table)

df %>%
    mutate(rownum = row_number(), .before = 1) %>%
    pivot_longer(cols = -rownum, names_to = "Range") %>%
    na.omit() %>%
    mutate(grpnum = rleid(value), .by = rownum) %>%
    summarise(
        num = n(),
        Range = str_c(Range[c(1, n())], collapse = "-"),
        .by = c(rownum, value, grpnum)
    )

which gives

# A tibble: 8 × 5
  rownum value grpnum   num Range
   <int> <int>  <int> <int> <chr>
1      1     1      1     8 B-I
2      1     0      2     5 J-N
3      2     1      1     3 E-G
4      2     0      2     1 H-H
5      2     1      3     2 I-J
6      2     0      4     2 K-L
7      2     1      5     1 M-M
8      2     0      6     1 N-N

Data

> dput(df)
structure(list(A = c(NA, NA), B = c(1L, NA), C = c(1L, NA), D = c(1L,
NA), E = c(1L, 1L), F = c(1L, 1L), G = c(1L, 1L), H = 1:0, I = c(1L,
1L), J = 0:1, K = c(0L, 0L), L = c(0L, 0L), M = 0:1, N = c(0L,
0L)), class = "data.frame", row.names = c(NA, -2L))
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81