1

I have the dataframe (sample) below:

companyID   year   yearID
    1       2010     1
    1       2011     2
    1       2012     3
    1       2013     4
    2       2010     1
    2       2011     2
    2       2016     3
    2       2017     4
    2       2018     5
    3       2010     1
    3       2011     2
    3       2014     3
    3       2017     4
    3       2018     5

I have used a for loop in order to try and create a sequence column that starts a new number for each new sequence of numbers. I am new to R so my definitions may be a bit wrong. My for loop looks like this:

size1 <- c(1:3)
s <- 0
for (val1 in size) {
  m <- max(sample[sample$companyID == val1, 4])
  size2 <- c(1:m)
  for (val2 in size2){ 
    row <- sample[which(sample$companyID == val1 & sample$yearID == val2)]
    m1 <- sample[sample$companyID == val1 & sample$yearID == val2, 2]
    m2 <- sample[sample$CompanyID == val1 & sample$yearID == (val2-1), 2]
    if(val2>1 && m1-m2 > 1) {
                  sample$sequence[row] s = s+1}
    else {s = s}
  }
  }

Where m is the max value of the yearID per companyID, row is to identify that the value should be entered on the row where companyID = val1 and yearID = val2, m1 is from the year variable and is the latter year, whereas m2 is the former year. What I have tried to do is to change the sequence every time m1-m2 > 1 (when val2 > 1 also).

Desired outcome:

companyID   year   yearID   sequence
    1       2010     1          1
    1       2011     2          1
    1       2012     3          1
    1       2013     4          1
    2       2010     1          2
    2       2011     2          2
    2       2016     3          3
    2       2017     4          3
    2       2018     5          3
    3       2010     1          4
    3       2011     2          4
    3       2014     3          5
    3       2017     4          6
    3       2018     5          6

Super appreciative if anyone can help!!

2 Answers2

2

This is a good question!

  1. First group_by companyID
  2. calculate the difference of each consecutive row in year column with lag to identify if year is consecutive.
  3. group_by companyID, yearID)
  4. mutate helper column sequence1 to apply 1 to each starting consecutive year in group.
  5. ungroup and apply a sequence number eachtime 1 occurs in sequence1
  6. remove column sequence1 and deltalag1
library(tidyverse)

df1 <- df %>% 
  group_by(companyID) %>% 
  mutate(deltaLag1 = year - lag(year, 1)) %>% 
  group_by(companyID, yearID) %>% 
  mutate(sequence1 = case_when(is.na(deltaLag1) | deltaLag1 > 1 ~ 1,
                               TRUE ~ 2)) %>% 
  ungroup() %>% 
  mutate(sequence = cumsum(sequence1==1)) %>% 
  select(-deltaLag1, -sequence1)

data

df <- tribble(
~companyID,   ~year,   ~yearID,
1, 2010, 1, 
1, 2011, 2, 
1, 2012, 3, 
1, 2013, 4, 
2, 2010, 1, 
2, 2011, 2, 
2, 2016, 3, 
2, 2017, 4, 
2, 2018, 5, 
3, 2010, 1, 
3, 2011, 2, 
3, 2014, 3, 
3, 2017, 4, 
3, 2018, 5)

enter image description here

TarJae
  • 72,363
  • 6
  • 19
  • 66
0

It's not clear if you want the exact desired outcome or check that you have consecutive years by companyID.

According to your title message:

sample <- read.table(header = TRUE, text = "
companyID   year   yearID
    1       2010     1
    1       2011     2
    1       2012     3
    1       2013     4
    2       2010     1
    2       2011     2
    2       2016     3
    2       2017     4
    2       2018     5
    3       2010     1
    3       2011     2
    3       2014     3
    3       2017     4
    3       2018     5
")

library(data.table)
sample <- setDT(sample)
sample[ , diff_year := year - shift(year), by = companyID]    
sample <- setDF(sample)
sample
#>    companyID year yearID diff_year
#> 1          1 2010      1        NA
#> 2          1 2011      2         1
#> 3          1 2012      3         1
#> 4          1 2013      4         1
#> 5          2 2010      1        NA
#> 6          2 2011      2         1
#> 7          2 2016      3         5
#> 8          2 2017      4         1
#> 9          2 2018      5         1
#> 10         3 2010      1        NA
#> 11         3 2011      2         1
#> 12         3 2014      3         3
#> 13         3 2017      4         3
#> 14         3 2018      5         1

# Created on 2021-03-13 by the reprex package (v1.0.0.9002)

Related to Calculate difference between values in consecutive rows by group

Regards,

barboulotte
  • 395
  • 2
  • 8