0

I have a df (“df”) containing multiple time series (value ~ time) whose observations are grouped by 3 factors: temp, rep, and species. These data need to be trimmed at the lower and upper ends of the time series, but these threshold values are group conditional (e.g. remove observations below 2 and above 10 where temp=10, rep=2, and species = “A”). I have an accompanying df (df_thresholds) that contains grouping values and the mins and maxs i want to use for each group. Not all groups need trimming (I would like to update this file regularly which would guide where to trim df). Can anybody help me conditionally filter out these values by group? I have the following, which is close but not quite there. When I reverse the max and min boolean tests, I get zero observations.

df <- data.frame(species = c(rep("A", 16), rep("B", 16)),
                 temp=as.factor(c(rep(10,4),rep(20,4),rep(10,4),rep(20,4))),
                 rep=as.factor(c(rep(1,8),rep(2,8),rep(1,8),rep(2,8))),
                 time=rep(seq(1:4),4),
                 value=c(1,4,8,16,2,4,9,16,2,4,10,16,2,4,15,16,2,4,6,16,1,4,8,16,1,2,8,16,2,3,4,16))

df_thresholds <- data.frame(species=c("A", "A", "B"), 
                            temp=as.factor(c(10,20,10)),
                            rep=as.factor(c(1,1,2)), 
                            min_value=c(2,4,2),
                            max_value=c(10,10,9))

#desired outcome
df_desired <- df[c(2:3,6:7,9:24,26:27,29:nrow(df)),]


#attempt
df2 <- df

for (i in 1:nrow(df_thresholds)) {  
  df2 <- df2 %>%
    filter(!(species==df_thresholds$species[i] & temp==df_thresholds$temp[i] & rep==df_thresholds$rep[i] & value>df_thresholds$min_value[i] & value<df_thresholds$max_value[i]))
}

EDIT: Here's the solution I implemented per suggestions below.

df_test <- left_join(df, df_thresholds, by=c('species','temp','rep'))
df_test$min_value[is.na(df_test$min_value)] <- 0
df_test$max_value[is.na(df_test$max_value)] <- 999

df_test2 <- df_test %>%
  filter(value >= min_value & value <= max_value)
bishopia
  • 13
  • 7
  • I'm not sure what the expected result should be but an option might be join the two df with a `left_join` and the apply your filter to the resulting df `df_test <- left_join(df, df_thresholds, by=c("species", 'temp', 'rep')) %>% filter(value > min_value & value < max_value )` – NColl Dec 11 '18 at 02:29
  • you could use `data.table::foverlaps`, with this function you can (fast!) perform overlapping joins... read the answers in this topic: https://stackoverflow.com/questions/24480031/roll-join-with-start-end-window. – Wimpel Dec 11 '18 at 02:32
  • i added the desired outcome, but i wanted to point out that i would like to subset by constraining the time variable, not the values. – bishopia Dec 11 '18 at 12:57
  • 1
    @NColl, thank you very much. by left joining my limits to df and filter against the value variable i excluded everything i wanted to. – bishopia Dec 11 '18 at 13:26

1 Answers1

0

We can find out indices which we want to exclude using mapply

df[-c(with(df_thresholds, 
      mapply(function(x, y, z, min_x, max_x) 
           which(df$species == x & df$temp == y & df$rep == z & 
              (df$value < min_x | df$value > max_x)),
                 species, temp, rep, min_value, max_value))), ]


#   species temp rep time value
#2        A   10   1    2     4
#3        A   10   1    3     8
#6        A   20   1    2     4
#7        A   20   1    3     9
#9        A   10   2    1     2
#10       A   10   2    2     4
#11       A   10   2    3    10
#12       A   10   2    4    16
#......

In mapply we pass all the columns of df_thresholds filter df accordingly and find out indices which are outside min and max value for each row and exclude them from the original dataframe.

The result of mapply call is

#[1]  1  4  5  8 25 28

which are the rows we want to exclude from the df since they fall out of range.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • i'm getting the output you're posting, but it's not working that well with my actual dataset. do my grouping variables need to be factors, or can they be numeric, integer or character? also, if the number of observations per group varies, would that affect your answer? – bishopia Dec 11 '18 at 13:01
  • @bishopia no, grouping variables could be anything, it should still work. Can you tell me what is not working in your actual dataset? Is it giving you more rows? or less rows or is it giving you any error? – Ronak Shah Dec 11 '18 at 13:06
  • "Error in -c(with(yy, mapply(function(x, y, z, min_x, max_x) which(xx$ref == : invalid argument to unary operator". i changed "species" to "ref" – bishopia Dec 11 '18 at 13:09
  • @bishopia can you remove `df[-c(` part and run the code ? Do you get row indices as shown for above example ? – Ronak Shah Dec 11 '18 at 13:16