3

I am currently trying to select non-consecutive dates for every grouped column.

In other words, I have the below dataframe:

enter image description here

I would like to basically group_by(Site) and then keep only 3 random non-consecutive dates for every grouped Site. For example, if HP37P1B has dates corresponding to 12th March, 13th March, 14th March and 7th March - I need a dataframe (for example) that only has:

HP37P1B 12th March

HP37P1B 14th March

HP37P1B 7th March

So far I have tried a number of stackoverflow posts that use diff(), ave(), and the lubridate package, but I haven't had any success.

EDIT

Based on comments below, I am trying to make this question reproducible

dput(uniqueSiteDate)

structure(list(Site = c("HP37P1B", "HP37P2B", "HP37P4B", "HP4008U", 
"INME03R", "INME03U", "INOA03R", "IPTO04R", "IPTO04U", "IPTO06R", 
"IPTO06U", "OLCAP2B", "OLCAP3B", "OLCAP5B", "PANMP1B", "PANMP2B", 
"PANMP3B", "STIN02R", "STIN02U", "UPMAP1B", "UPMAP3B", "UPMAP4B", 
"UPMAP5B", "UPMAP6B", "VAR210R", "VAR310R", "VAR310U", "VAR410R", 
"VAR410U", "HP36P1B", "HP36P3B", "HP36P4B", "HP4008R", "INBS04R", 
"INBS04U", "SEL107R", "SEL107U", "SEL207R", "SEL207U", "OLV110R", 
"OLV110U", "OLV208R", "OLV208U", "THEN10U", "HP37P1B", "HP37P2B", 
"HP37P4B", "HP4008U", "INME03R", "INME03U", "INOA03R", "IPTO04R", 
"IPTO04U", "IPTO06R", "IPTO06U", "OLCAP2B", "OLCAP3B", "OLCAP5B", 
"PANMP1B", "PANMP2B", "PANMP3B", "STIN02R", "STIN02U", "UPMAP1B", 
"UPMAP3B", "UPMAP4B", "UPMAP5B", "UPMAP6B", "VAR210R", "VAR310R", 
"VAR310U", "VAR410R", "VAR410U", "OLV110R", "OLV110U", "OLV208R", 
"OLV208U", "THEN10U", "HP37P1B", "HP37P2B", "HP37P4B", "HP4008U", 
"INME03R", "INME03U", "INOA03R", "IPTO04R", "IPTO04U", "IPTO06R", 
"IPTO06U", "OLCAP2B", "OLCAP3B", "OLCAP5B", "PANMP1B", "PANMP2B", 
"PANMP3B", "STIN02R", "STIN02U", "UPMAP1B", "UPMAP3B", "UPMAP4B", 
"UPMAP5B", "UPMAP6B", "VAR210R", "VAR310R", "VAR310U", "VAR410R", 
"VAR410U", "OLV110R", "OLV110U", "OLV208R", "OLV208U", "THEN10U", 
"HP37P1B", "HP37P2B", "HP37P4B", "HP4008U", "INME03R", "INME03U", 
"INOA03R", "IPTO04R", "IPTO04U", "IPTO06R", "IPTO06U", "OLCAP2B", 
"OLCAP3B"), Date = structure(c(18333, 18333, 18333, 18333, 18335, 
18335, 18335, 18338, 18335, 18338, 18335, 18333, 18333, 18333, 
18334, 18334, 18334, 18331, 18331, 18331, 18330, 18330, 18330, 
18330, 18332, 18332, 18332, 18332, 18332, 18325, 18325, 18325, 
18325, 18327, 18327, 18327, 18327, 18327, 18328, 18340, 18340, 
18340, 18340, 18340, 18334, 18334, 18334, 18334, 18336, 18336, 
18336, 18339, 18336, 18340, 18336, 18335, 18334, 18334, 18335, 
18335, 18335, 18332, 18332, 18332, 18331, 18331, 18331, 18331, 
18333, 18333, 18333, 18333, 18333, 18341, 18341, 18341, 18341,
18341, 18335, 18335, 18335, 18335, 18383, 18383, 18383, 18384, 
18384, 18384, 18384, 18385, 18385, 18335, 18342, 18342, 18341, 
18383, 18383, 18345, 18349, 18349, 18349, 18349, 18340, 18339, 
18340, 18341, 18339, 18386, 18386, 18348, 18346, 18347, 18328, 
18328, 18328, 18328, 18390, 18389, 18391, 18392, 18392, 18392, 
18392, 18392, 18392), class = "Date")), row.names = c(NA, -125L
), groups = structure(list(Site = c("HP36P1B", "HP36P3B", "HP36P4B", 
"HP37P1B", "HP37P2B", "HP37P4B", "HP4008R", "HP4008U", "INBS04R", 
"INBS04U", "INME03R", "INME03U", "INOA03R", "IPTO04R", "IPTO04U", 
"IPTO06R", "IPTO06U", "OLCAP2B", "OLCAP3B", "OLCAP5B", "OLV110R", 
"OLV110U", "OLV208R", "OLV208U", "PANMP1B", "PANMP2B", "PANMP3B", 
"SEL107R", "SEL107U", "SEL207R", "SEL207U", "STIN02R", "STIN02U", 
"THEN10U", "UPMAP1B", "UPMAP3B", "UPMAP4B", "UPMAP5B", "UPMAP6B", 
"VAR210R", "VAR310R", "VAR310U", "VAR410R", "VAR410U"), .rows = structure(list(
    30L, 31L, 32L, c(1L, 45L, 79L, 113L), c(2L, 46L, 80L, 114L
    ), c(3L, 47L, 81L, 115L), 33L, c(4L, 48L, 82L, 116L), 34L, 
    35L, c(5L, 49L, 83L, 117L), c(6L, 50L, 84L, 118L), c(7L, 
    51L, 85L, 119L), c(8L, 52L, 86L, 120L), c(9L, 53L, 87L, 121L
    ), c(10L, 54L, 88L, 122L), c(11L, 55L, 89L, 123L), c(12L, 
    56L, 90L, 124L), c(13L, 57L, 91L, 125L), c(14L, 58L, 92L), 
    c(40L, 74L, 108L), c(41L, 75L, 109L), c(42L, 76L, 110L), 
    c(43L, 77L, 111L), c(15L, 59L, 93L), c(16L, 60L, 94L), c(17L, 
    61L, 95L), 36L, 37L, 38L, 39L, c(18L, 62L, 96L), c(19L, 63L, 
    97L), c(44L, 78L, 112L), c(20L, 64L, 98L), c(21L, 65L, 99L
    ), c(22L, 66L, 100L), c(23L, 67L, 101L), c(24L, 68L, 102L
    ), c(25L, 69L, 103L), c(26L, 70L, 104L), c(27L, 71L, 105L
    ), c(28L, 72L, 106L), c(29L, 73L, 107L)), ptype = integer(0), class = c("vctrs_list_of", 
"vctrs_vctr", "list"))), row.names = c(NA, -44L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))

To answer other questions, sometimes there are more than 3 dates per site, but sometimes there is just 1 date per site. But the idea is to choose n Number of non-consecutive dates given a Site. In other words, if a particular site has 4 dates, I need 3 non-consecutive ones. If a particular site has only 1 date, let's just leave that in.

pietrodito
  • 1,783
  • 15
  • 24
Vijay Ramesh
  • 191
  • 1
  • 2
  • 20
  • 2
    1. Could you make the example [reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? 2. Are there always at least 3 dates by `Site`? 3. Is the "non consecutive" part a hard requirement? Or are solutions that sample 3 dates at random (thus possibly consecutive) satisfactory? And if not, is it guaranteed that there are not only 3 dates, but 3 dates that make it possible to find a solution? – Aurèle Apr 22 '21 at 09:17
  • Even your picture is not matching with the stated example?? Date in first `site` is same for all three rows. – AnilGoyal Apr 22 '21 at 09:47
  • I added more information to help answer this question @Aurèle – Vijay Ramesh Apr 22 '21 at 09:49
  • @AnoushiravanR See edits above – Vijay Ramesh Apr 22 '21 at 09:50
  • In the data provided, the all dates for first group are same? – AnilGoyal Apr 22 '21 at 09:51
  • @AnilGoyal If you notice clearly, they are different groups - HP37P1B, HP37P2B – Vijay Ramesh Apr 22 '21 at 09:56
  • Ok. But the example is not minimal. Please post `dput` such that it can be toyed around. With six groups, one row each group, how can one make a random sample of three per group?? – AnilGoyal Apr 22 '21 at 10:00
  • @AnilGoyal I have added the dput now. Please see. – Vijay Ramesh Apr 22 '21 at 10:06
  • What if 3 non-conscutive dates cannot be produced from those 4 dates, what then? You say you _need_ 3, so that's an error, right? – Sirius Apr 22 '21 at 10:08
  • In which case, I think we should basically just keep 2. – Vijay Ramesh Apr 22 '21 at 10:25

2 Answers2

1

Please check whether it serves the purpose? Actually, selecting maximum possible dates with the provide criteria is difficult (at least for me). We can identify dates in consecutive and non-consecutive groups by the following strategy. But consider two scenarios from a group of say 3 consecutive dates. If the random sample contains 2 units, these can be consecutive or non-consecutive as well. Suppose if we further select either odd (2) or even(1) rows then the sample would have been judgmental and not random in my opinion. This is the strategy adopted -

  • splitted the data in groups
  • carried out operations in each group separately through purrr::map_df which finally row binds the data
  • divided the data (now groups) in consecutive and non-consecutive dates (each consecutive date will be in its own group). Select unique row from each group.
  • finally select three (or less as per group outcome) from each of these rows.
library(tidyverse)

df %>% 
  ungroup() %>% 
  group_split(Site) %>% 
  map_df(., ~ .x %>% ungroup() %>%
        arrange(Date) %>%
        mutate(n = 1) %>%
        complete(Date = seq.Date(first(Date), last(Date), by = 'days')) %>%
        group_by(n = cumsum(is.na(n))) %>%
        filter(!is.na(Site)) %>%
        sample_n(1) %>%
        ungroup() %>%
        sample_n(min(n(), 3))) %>%
  select(-n)

# A tibble: 86 x 2
   Date       Site   
   <date>     <chr>  
 1 2020-03-04 HP36P1B
 2 2020-03-04 HP36P3B
 3 2020-03-04 HP36P4B
 4 2020-03-07 HP37P1B
 5 2020-03-12 HP37P1B
 6 2020-03-07 HP37P2B
 7 2020-03-12 HP37P2B
 8 2020-03-07 HP37P4B
 9 2020-03-12 HP37P4B
10 2020-03-04 HP4008R
# ... with 76 more rows

Note: Your dput was grouped so I had to add ungroup() in second line of the code, which you may remove

AnilGoyal
  • 25,297
  • 4
  • 27
  • 45
  • Thanks a ton @AnilGoyal. But, for the example of HP37P1B - there can be 3 non-consecutive dates that should be selected, but only two have been selected. For example, we could select 7th March (which you did), 12th March (which you did), but one could also select 14th March. Similarly for some other sites. – Vijay Ramesh Apr 22 '21 at 10:49
  • Yes exactly, but then sample won't be random. It'd the only possible sample then – AnilGoyal Apr 22 '21 at 10:52
  • No, I meant - I wanted to select atleast 3 random non-consecutive ones. In your code, we have selected only 2. – Vijay Ramesh Apr 22 '21 at 10:54
  • In the case of HP37P1B, atleast 3 non-consecutive ones can be selected, while for the case of HP36P1B, only 1 date can be selected. – Vijay Ramesh Apr 22 '21 at 10:55
  • ? Does that make sense? - Basically if I were to ask you to select any possible number of non-consecutive dates (at random), given the sample size. – Vijay Ramesh Apr 22 '21 at 10:57
  • Yes, this makes sense. But you please imagine a scenario or three consecutive dates say 1, 2, and 3. now random sample will either detect 2 or 1&3 both (sample from either even or odd numbered groups). Now you have an additional criteria that sample should maximum possible output i.e. 2 which seems mathematically not possible (considering the randomness) – AnilGoyal Apr 22 '21 at 11:02
  • then in this example you actually want odd rows only and not random sample – AnilGoyal Apr 22 '21 at 11:03
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/231451/discussion-between-anilgoyal-and-vijay-ramesh). – AnilGoyal Apr 22 '21 at 11:04
  • Aah, I see what you mean. Would it possible for you to show me how I can select odd dates then to maximize the number of dates chosen? – Vijay Ramesh Apr 22 '21 at 11:04
0

This should work, but I can't really test this on your data since it contains only one row per site (and thus only one date).


l <- dat %>% group_by( Site ) %>%
    group_map( function(x,y) {

        n.day.increments <- sum( diff(sort(x$Date)) == 1 )
        n.2day.increments <- sum( diff(sort(x$Date)) >= 2 )
        can.have.3.dates <- n.day.increments >= 5 || n.2day.increments >= 2 || (n.day.increments >= 2 && n.2day.increments >= 1 )
        can.have.2.dates <- n.day.increments >= 2 || n.2day.increments >= 1


        repeat{

            if( can.have.3.dates ) {
                d <- x %>% sample_n(3)

                if( all(abs(diff(d$Date)) > 1 ) )
                    break

            } else if( can.have.2.dates ) {

                if( can.have.2.dates ) {
                    d <- x %>% sample_n(2)

                    if( all(abs(diff(d$Date)) > 1 ) )
                        break

                }

            } else {
                d <- x %>% sample_n(1)
                break
            }

        }
        cbind( y, d )
    })

sampled.data <- bind_rows(l)


Sirius
  • 5,224
  • 2
  • 14
  • 21
  • Getting an error right now. Could you see the edits to the question? – Vijay Ramesh Apr 22 '21 at 09:49
  • This should work even if there is only one date associated with a Site right? – Vijay Ramesh Apr 22 '21 at 10:02
  • no, absolutely not, your updated question said that also 2 nonconsecutive dates was ok, so that was my new minimum requirement – Sirius Apr 22 '21 at 10:02
  • I have edited the question further - please see. – Vijay Ramesh Apr 22 '21 at 10:07
  • you need to be super clear about this. What should the code do when it cannot fulfill 3 nonconsecutive dates. select 3 consecituve if possible? go for 2 nonconsectuve instead? what if it cannot produce 2 nonconsecutive but can do 2 consecutive, should it do those 2, or just return 1? – Sirius Apr 22 '21 at 10:07
  • If there are 3 dates and all 3 are non-consecutive, let's keep them - else just choose 2 non-consecutive ones. If there is only 1 date - let's just keep that. If there are 4 dates, I want 3 non-consecutive ones. Does that help? Sorry! – Vijay Ramesh Apr 22 '21 at 10:08
  • what if 4 dates cannot produce 3 non-consecutive dates? what a site contains only 2 dates that are consecutive - should I then select just one or do you still want these 2? We can't guess these things... – Sirius Apr 22 '21 at 10:10
  • Fair enough - if there are 2 dates that are consecutive, we keep only one. If there 4 dates that are consecutive, then we keep just 2 non-consecutive ones. – Vijay Ramesh Apr 22 '21 at 10:25
  • check the strategy proposed. First, the code checks for consecutive date groups using `complete` function thereafter samples. I am adding an explanation. This way however, you cannot have more than one date per group of consecutive dates even when possible. However, I am also thinking of strategy to include more dates from these kinda groups. – AnilGoyal Apr 22 '21 at 10:36
  • It's updated to produce 3 if possible, or produce 2 if possible, otherwise just draw 1 – Sirius Apr 22 '21 at 10:36
  • made some edits, struggled with the checks for 3 days or 2days. @AnilGoyal was that a comment for me? – Sirius Apr 22 '21 at 10:46
  • Updated it again, the algorithm to check for 3 non-consec dates should be revised to somethign more elegant, but I think it will work now – Sirius Apr 22 '21 at 10:52