Group rows in data frame based on time difference between consecutive rows

Question

I have a data frame of this type

YEAR   MONTH  DAY  HOUR       LON      LAT

1860     10      3   13      -19.50   3.00          
1860     10      3   17      -19.50   4.00                          
1860     10      3   21      -19.50   5.00                          
1860     10      5   5       -20.50   6.00                          
1860     10      5   13      -21.50   7.00                          
1860     10      5   17      -21.50   8.00                          
1860     10      6   1       -22.50   9.00                          
1860     10      6   5       -22.50   10.00                         
1860     12      5   9       -22.50   -7.00                         
1860     12      5   18      -23.50   -8.00                         
1860     12      5   22      -23.50   -9.00                         
1860     12      6   6       -24.50   -10.00                                    
1860     12      6   10      -24.50   -11.00                            
1860     12      6   18      -24.50   -12.00

What I wold like to do is to calculate the interpolating line for every subset of temporally close points (e.g. temporal difference between consecutive points is less than 4 days; in the example above there are 2 subset: one from 1860-10-3 till 1860-10-6 and the other from 1860-12-5 till 1860-12-6) and then create an extra column with the fit correlation coefficient associate with the respective subset interpolating line.

The problem is that I don't know how to subset my data frame properly according to the criteria stated above.

Please show the expected output. – Sven Hohenstein Dec 12 '13 at 15:21 — Sven Hohenstein, Dec 12 '13 at 15:21

score 12 · Accepted Answer · edited May 23 '17 at 10:29

Here is another possibility which groups rows where the time difference between consecutive rows is less than 4 days.

# create date variable
df$date <- with(df, as.Date(paste(YEAR, MONTH, DAY, sep = "-")))

# calculate succesive differences between dates
# and identify gaps larger than 4
df$gap <- c(0, diff(df$date) > 4)

# cumulative sum of 'gap' variable
df$group <- cumsum(df$gap) + 1

df    
#    YEAR MONTH DAY HOUR   LON LAT       date gap group
# 1  1860    10   3   13 -19.5   3 1860-10-03   0     1
# 2  1860    10   3   17 -19.5   4 1860-10-03   0     1
# 3  1860    10   3   21 -19.5   5 1860-10-03   0     1
# 4  1860    10   5    5 -20.5   6 1860-10-05   0     1
# 5  1860    10   5   13 -21.5   7 1860-10-05   0     1
# 6  1860    10   5   17 -21.5   8 1860-10-05   0     1
# 7  1860    10   6    1 -22.5   9 1860-10-06   0     1
# 8  1860    10   6    5 -22.5  10 1860-10-06   0     1
# 9  1860    12   5    9 -22.5  -7 1860-12-05   1     2
# 10 1860    12   5   18 -23.5  -8 1860-12-05   0     2
# 11 1860    12   5   22 -23.5  -9 1860-12-05   0     2
# 12 1860    12   6    6 -24.5 -10 1860-12-06   0     2
# 13 1860    12   6   10 -24.5 -11 1860-12-06   0     2
# 14 1860    12   6   18 -24.5 -12 1860-12-06   0     2

Disclaimer: the diff & cumsum part is inspired by this Q&A: How to partition a vector into groups of regular, consecutive sequences?.

How can I want to represent in the `group` column the first value of `DAY` column, instead the values 1, 2 ...? — Wilson Souza, Aug 05 '22 at 12:31
@Wilson Souza Once you have created the grouping variable, see e.g. [Calculate group mean, sum, or other summary stats. and assign column to original data](https://stackoverflow.com/questions/6053620/calculate-group-mean-sum-or-other-summary-stats-and-assign-column-to-original), where your calculation is "first value" or `min`. — Henrik, Aug 05 '22 at 12:40

score 0 · Answer 2 · edited Dec 12 '13 at 14:49

0

I would try something along these lines. Since you mention that you only need to figure out the subsetting logic, I haven't bothered to add the correlation coeff calculation.

df$date <- as.Date(paste(df$YEAR,df$MONTH,df$DAY),'%Y %m %d')

uniquedates <- unique(df$date)
uniquedatesfourth <- uniquedates + 4

for ( i in seq(length(uniquedates)))
{
   tempsubset <- subset(df, date >= uniquedates[i] & date >= uniquedatesfourth[i])
   # operations on tempsubset
}

edited Dec 12 '13 at 14:49

zx8754

52,746
12
114
209

answered Dec 12 '13 at 14:46

TheComeOnMan

12,535
8
39
54

Sorry but I don't understand how to use the tempsubset created: e.g. how can I assign an index to every element of the data frame indicating to what subset it belongs and how can I see the created subsets? Many thanks – user3036416 Dec 12 '13 at 16:57
You could make tempsubset a list, `tempsubset <- vector(mode="list", length = length(uniquedates))`, `tempsubset[[i]] <- subset...` – TheComeOnMan Dec 12 '13 at 17:14
Thanks, but the other solution suited better my problem. – user3036416 Dec 12 '13 at 17:40

Group rows in data frame based on time difference between consecutive rows

2 Answers2

Linked

Related