4

What I am trying to do seems simple- but after 2 days of searching I have decided to post my first question here to see if anyone can help.

I have a dataframe(df) of 5 variables and 250,000 rows. Sample:

            date.time       Lat      Lon    Depth   ms
 1: 2015-11-23 01:14:00 -3.230916 135.0655 100.5  0.391
 2: 2015-11-23 03:05:00 -3.231362 135.0650 300.5  0.225
 3: 2015-11-23 03:22:00 -3.231431 135.0649 500.5  0.091
 4: 2015-11-23 10:51:00 -3.233221 135.0632 400.5  0.0916
 5: 2015-11-23 10:52:00 -3.233225 135.0632 300.5  0.0333
 6: 2015-11-23 11:32:00 -3.233383 135.0630 100.5  0.3833
 7: 2015-11-23 11:33:00 -3.233387 135.0630 200.0 -0.0750
 8: 2015-11-23 12:14:00 -3.233549 135.0629 220.0  0.3166
 9: 2015-11-23 12:15:00 -3.233553 135.0629 300.5  0.0083
10: 2015-11-23 12:39:00 -3.233647 135.0628 500.5  0.3000
11: 2016-10-15 00:37:30 -3.349524 135.0997 550.5 -0.0083
12: 2016-10-15 00:38:30 -3.349537 135.0997 600.0 -0.0583
13: 2016-10-15 00:39:30 -3.349550 135.0998 400.5  0.0583
14: 2016-10-15 00:39:30 -3.349550 135.0998 400.5  0.0583
15: 2016-10-15 00:39:30 -3.349550 135.0998 600.5  0.0583

I want to select the previous n rows (determined by whether it is in increasing order; i.e 100, 200, 300, 400, 500, 600 NOT 100, 200, 400, 100, 50) before Depth values > 500m of the largest value above 500m (to avoid repeats of the same data). I want each of those rows in their entirety to appear in a new dataframe (newdf) :

             date.time       Lat      Lon    Depth   ms
 1: 2015-11-23 01:14:00 -3.230916 135.0655 100.5  0.391
 2: 2015-11-23 03:05:00 -3.231362 135.0650 300.5  0.225
 **3: 2015-11-23 03:22:00 -3.231431 135.0649 500.5  0.091**
 6: 2015-11-23 11:32:00 -3.233383 135.0630 100.5  0.3833
 7: 2015-11-23 11:33:00 -3.233387 135.0630 200.0 -0.0750
 8: 2015-11-23 12:14:00 -3.233549 135.0629 220.0  0.3166
 9: 2015-11-23 12:15:00 -3.233553 135.0629 300.5  0.0083
10: 2015-11-23 12:39:00 -3.233647 135.0628 500.5  0.3000
11: 2016-10-15 00:37:30 -3.349524 135.0997 550.5 -0.0083
**12: 2016-10-15 00:38:30 -3.349537 135.0997 600.0 -0.0583**
14: 2016-10-15 00:39:30 -3.349550 135.0998 400.5  0.0583
**15: 2016-10-15 00:39:30 -3.349550 135.0998 600.5  0.0583**

I have tried the following code:

which_max <- which(df$Depth >= 500)
encoding <- rle(diff(df$Depth) > 0) 

# these contain the start/end indices of all continuously increasing/decreasing subsets
ends <- cumsum(encoding$lengths) + 1L
starts <- ends - encoding$lengths

# filter out the decreasing subsets
starts <- starts[encoding$values]
ends <- ends[encoding$values]

# find the one that contains the maximum
interval <- which(starts <= which_max & ends >= which_max)
out <- df[starts[interval]:ends[interval],] #picks only selected interval to print

based on a previous stack post (Subset only continuously increasing values to max value), but can only get one set of the highest values from my dataset printed instead of each from the original (df):

            date.time      Lat      Lon     Depth   ms
 1: 2016-05-11 23:44:30 1.769763 136.6246  102.0 0.600
 2: 2016-05-11 23:53:30 1.773071 136.6247  108.0 0.7250
 3: 2016-05-11 23:54:30 1.773439 136.6247  193.0 1.4166
 4: 2016-05-11 23:55:30 1.773806 136.6248  281.5 1.475
 5: 2016-05-11 23:56:30 1.774174 136.6248  364.5 1.383
 6: 2016-05-11 23:57:30 1.774542 136.6248  447.0 1.3750
 7: 2016-05-11 23:58:30 1.774910 136.6248  528.0 1.350
 8: 2016-05-11 23:59:30 1.775278 136.6248  609.5 1.358
 9: 2016-05-12 00:00:30 1.775646 136.6248  690.0 1.3416
10: 2016-05-12 00:01:30 1.776013 136.6249  770.0 1.33333

I am assuming I need to employ some type of loop(?) but am relatively new to coding and unsure how to go about it.

EDIT: I have also tried using "lag" but it does not solve needing multiple increasing rows or not double back counting for multiple rows of >500 m (ie. 500, 550, 600, 700...)

I have also used:

df$selecteddepth <- df$Depth * (c(0, diff(df$Depth)) >= 10)

Which selects depths greater than the difference of 10 (which means they are always increasing) but does not address selecting depths of over 500m or removing duplicates

Here is a subset using dput()

structure(list(date.time = structure(c(1450574990, 1450575050, 
1450575110, 1450575170, 1450575230, 1450575290, 1450575350, 1450575410, 
1450575470, 1450575530, 1450575590, 1450575650, 1450575710, 1450575770, 
1450575830, 1450575890), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    Lat = c(-3.24669178745284, -3.24667124000555, -3.24665068714376, 
    -3.24663012886971, -3.24660956518562, -3.24658899609375, 
    -3.24656842159633, -3.24654784169558, -3.24652725639375, 
    -3.24650666569307, -3.24648606959577, -3.24646546810409, 
    -3.24644486122025, -3.24642424894649, -3.24640363128504, 
    -3.24638300823813), Lon = c(135.085169407522, 135.085165930176, 
    135.085162450626, 135.085158968873, 135.085155484919, 135.085151998764, 
    135.085148510411, 135.085145019861, 135.085141527116, 135.085138032177, 
    135.085134535045, 135.085131035722, 135.08512753421, 135.08512403051, 
    135.085120524624, 135.085117016552), Depth = c(373, 453, 
    500, 515.5, 521, 526.5, 512, 517.5, 522.5, 504, 522.5, 508.5, 
    481.5, 480, 474, 453), ms = c(1.60833333333333, 1.33333333333333, 
    0.783333333333333, 0.258333333333333, 0.0916666666666667, 
    0.0916666666666667, -0.241666666666667, 0.0916666666666667, 
    0.0833333333333333, -0.308333333333333, 0.308333333333333, 
    -0.233333333333333, -0.45, -0.025, -0.1, -0.35)), row.names = c(NA, 
-16L), class = c("data.table", "data.frame"))

EDIT20-8-21 for det

Current output:

Current Output using dets script

As you can see, the output is in descending depth(970 > 929.5 > 888> 851.5... where the ms are positive) I need the same idea shown above, but in ascending order so it would look something like this (made up data: 500 > 545 > 600 > 700) and the ms should be negative (most of the time) as the animal is diving (negative velocity). So I need the top number that is returned in depth to be smaller than the following numbers. I hope this clarifies it!

Meg.abytes
  • 169
  • 8
  • how about using "lag" ? – jhk0530 Aug 19 '21 at 03:45
  • I have tried using lag as an option, but it only selects one row instead of the amount of rows before >500 depth and doesn't account for increasing numbers only or multiple values of >500 together (i.e. 500, 600, 700, 800...) – Meg.abytes Aug 19 '21 at 03:52
  • Could you share your dataset? It helps by sharing a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Desmond Aug 19 '21 at 03:54
  • The `dput` data is different than what you have shown. What is the expected output for the data shared? – Ronak Shah Aug 19 '21 at 05:55
  • The dput is different because I wanted to select rows where there was good representation of all of the options (depths above 500 and increasing and some decreasing values). The data that I entered here initially was just some data I pulled and motified to display what I wanted) Sorry about the confusion – Meg.abytes Aug 19 '21 at 06:12

1 Answers1

1

This needs to be done iteratively because value is changing. Idea is to find first position on which is greater value and then according to that position find rows that will be in that group and update needed parameters.

cur_start <- 1
cur_value <- 500L
x <- df$Depth
l <- list()
i <- 1

repeat{
  
  if(cur_start > length(x)) break
  
  first_greater <- which(x[cur_start:length(x)] > cur_value)[1]
  
  if(is.na(first_greater)){
    
    break
    
  } else if(first_greater == 1){
    
    cur_start <- cur_start + 1
    next
  }
  
  pos_greater <- cur_start - 1 + first_greater
  cur_value <- x[[pos_greater]]
  
  res <- diff(x[pos_greater:cur_start]) < 0
  
  if(all(res)){
    
    l[[i]] <- cur_start:(pos_greater - 1)
    
  } else {
    
    l[[i]] <-  rev(pos_greater - seq_len(which.min(res) - 1))
  }

  cur_start <- pos_greater + 1
  i <- i + 1
}

lapply(l, function(x) df[x,])
det
  • 5,013
  • 1
  • 8
  • 16
  • Thanks, Det for your answer. This is very close to what I need but the output is the depth values that are in descending order instead of ascending order (i.e. current output: 700, 600, 500, 400... desired output: 400, 500, 600, 700). This is important based on the analysis I am attempting. Also, I am unsure how to convert the output to a new dataframe as the list is unlike any I have worked with previously. – Meg.abytes Aug 20 '21 at 00:51
  • I'm not sure what output you desire, something like: `Reduce(rbind, lapply(l, function(x) df[x,]))` instead of `lapply(l, function(x) df[x,])`? – det Aug 20 '21 at 05:27
  • Hi det, I added edits to my post above to clarify – Meg.abytes Aug 20 '21 at 06:48
  • You are giving examples on some dataset that you didn't provide so it's hard to pinpoint problems. I think it should be fine now but there is no way I can test it because I dont have the right dataset. – det Aug 20 '21 at 07:27
  • Thank you det! This is exactly what I needed. Apologies about not including the entire dataset (I wasn't sure how to do it because it was so large). – Meg.abytes Aug 24 '21 at 01:32
  • Glad it worked. You don't need to include entire dataset (that is usually bad idea) but only minimal representative subset that includes problems you are trying to solve. – det Aug 24 '21 at 05:40