1

Was hoping someone can assist.

I am trying to add another column: 6th Worst. What I want to do is for it to produce the 6th worst y result based on a specified criteria: Date.

Here is an example of my df:

Key     Date                     y   x1   x2   x3
   1    1/10/2018 12:00:00 AM    2   3    2    5
   1    1/11/2018 12:00:00 AM    3   5    7    2
   1    1/12/2018 12:00:00 AM    5   7    4    7 
   1    1/13/2018 12:00:00 AM    7   2    7    6
   2    1/10/2018 12:00:00 AM    2   6    3    8
   2    1/11/2018 12:00:00 AM    3   7    7    3
   2    1/12/2018 12:00:00 AM    3   2    3    4
   2    1/13/2018 12:00:00 AM    7   6    2    7
   3    1/10/2018 12:00:00 AM    2   3    2    5
   3    1/11/2018 12:00:00 AM    3   5    7    2
   3    1/12/2018 12:00:00 AM    5   7    4    7 
   3    1/13/2018 12:00:00 AM    7   2    7    6
   3    1/10/2018 12:00:00 AM    2   6    3    8
   3    1/11/2018 12:00:00 AM    3   7    7    3
   3    1/12/2018 12:00:00 AM    3   2    3    4
   3    1/13/2018 12:00:00 AM    7   6    2    7
   4    1/10/2018 12:00:00 AM    2   3    2    5
   4    1/11/2018 12:00:00 AM    3   5    7    2
   4    1/12/2018 12:00:00 AM    5   7    4    7 
   4    1/13/2018 12:00:00 AM    7   2    7    6
   4    1/10/2018 12:00:00 AM    2   6    3    8
   4    1/11/2018 12:00:00 AM    3   7    7    3
   5    1/12/2018 12:00:00 AM    3   2    3    4
   5    1/13/2018 12:00:00 AM    7   6    2    7
   5    1/10/2018 12:00:00 AM    2   3    2    5
   5    1/11/2018 12:00:00 AM    3   5    7    2
   5    1/12/2018 12:00:00 AM    5   7    4    7 
   5    1/13/2018 12:00:00 AM    7   2    7    6
   6    1/10/2018 12:00:00 AM    2   6    3    8
   6    1/11/2018 12:00:00 AM    3   7    7    3
   6    1/12/2018 12:00:00 AM    3   2    3    4
   6    1/13/2018 12:00:00 AM    7   6    2    7

So for 1/10/2018 the 3. Hence, the Data set would look like this:

 Key        Date                     y   x1   x2   x3 6th worst   
       1    1/10/2018 12:00:00 AM    2   3    2    5  3
       1    1/11/2018 12:00:00 AM    3   5    7    2  ... (would have values)
       1    1/12/2018 12:00:00 AM    5   7    4    7  ... (would have values)
       1    1/13/2018 12:00:00 AM    7   2    7    6  ... (would have values)
       2    1/10/2018 12:00:00 AM    2   6    3    8  3
       2    1/11/2018 12:00:00 AM    3   7    7    3  etc.
       2    1/12/2018 12:00:00 AM    3   2    3    4
       2    1/13/2018 12:00:00 AM    7   6    2    7
       3    1/10/2018 12:00:00 AM    2   3    2    5
       3    1/11/2018 12:00:00 AM    3   5    7    2
       3    1/12/2018 12:00:00 AM    5   7    4    7 
       3    1/13/2018 12:00:00 AM    7   2    7    6
       3    1/10/2018 12:00:00 AM    2   6    3    8
       3    1/11/2018 12:00:00 AM    3   7    7    3
       3    1/12/2018 12:00:00 AM    3   2    3    4
       3    1/13/2018 12:00:00 AM    7   6    2    7
       4    1/10/2018 12:00:00 AM    2   3    2    5
       4    1/11/2018 12:00:00 AM    3   5    7    2
       4    1/12/2018 12:00:00 AM    5   7    4    7 
       4    1/13/2018 12:00:00 AM    7   2    7    6
       4    1/10/2018 12:00:00 AM    2   6    3    8
       4    1/11/2018 12:00:00 AM    3   7    7    3
       5    1/12/2018 12:00:00 AM    3   2    3    4
       5    1/13/2018 12:00:00 AM    7   6    2    7
       5    1/10/2018 12:00:00 AM    2   3    2    5
       5    1/11/2018 12:00:00 AM    3   5    7    2
       5    1/12/2018 12:00:00 AM    5   7    4    7 
       5    1/13/2018 12:00:00 AM    7   2    7    6
       6    1/10/2018 12:00:00 AM    2   6    3    8
       6    1/11/2018 12:00:00 AM    3   7    7    3
       6    1/12/2018 12:00:00 AM    3   2    3    4
       6    1/13/2018 12:00:00 AM    7   6    2    7

Here is what I have so far:

#to obtain the 6th worst value from the data set

n=length(df$y)

df$`6th Worst`= df$`6th Worst`= "-"

df[1,3] = round(-sort(subset(df,c(unique(Date), "y")), partial=n-5)[n-5], digits = 2)

I get the following error:

    Error in subset.data.frame(reg_predict, unique(reg_predict2$Date)) : 
  'subset' must be logical

Edit: question differs from the duplicate flagged question in several respects. Particularly in the fact that I need a conditional 6th worst scenario and not just a worst/best scenario.

g3lo
  • 153
  • 1
  • 11
  • Possible duplicate of [Fastest way to find second (third...) highest/lowest value in vector or column](https://stackoverflow.com/questions/2453326/fastest-way-to-find-second-third-highest-lowest-value-in-vector-or-column) – arealhumanbean Jun 26 '18 at 19:17
  • @arealhumanbean dont think this is duplicate. I saw this post. I am trying to implement this code into a more complex application as my OP . . . please advise where that post answers my question and I would be happy to remove this one. – g3lo Jun 26 '18 at 19:22
  • @g3lo Whats role of `Key` column here? – MKR Jun 26 '18 at 19:41

2 Answers2

0

An option using the data.table package:

library(data.table)

## Generate data
set.seed(1)
RowCount <- 100
DT <- data.table(Date = Sys.Date() + sample.int(3,RowCount,TRUE),
                 y = sample.int(100,RowCount,TRUE))

## Sort by y
setkey(DT,y)

## Too much to unpack here in inline commments, will expand further down
SixthWorst_DT <- DT[DT[,.I[6],by = .(Date)]$V1,.(Sixth_Worst = y), keyby = .(Date)]

print(SixthWorst_DT)

#    Date       Sixth_Worst
# 1: 2018-06-27          42
# 2: 2018-06-28          11
# 3: 2018-06-29          22

## Set DT Key to be date for update-join
setkey(DT,Date)
## Temporarily join `SixthWorst_DT` to `DT` (without making a full copy)
## and then create a column in `DT` based on the column `Sixth_Worst` in `SixthWorst_DT`
DT[SixthWorst_DT, Sixth_Worst := i.Sixth_Worst]

## Results
head(DT)

#    Date        y Sixth_Worst
# 1: 2018-06-27 18          42
# 2: 2018-06-27 18          42
# 3: 2018-06-27 19          42
# 4: 2018-06-27 19          42
# 5: 2018-06-27 39          42
# 6: 2018-06-27 42          42

The real meat of the operation is one line:

SixthWorst_DT <- DT[DT[,.I[6],by = .(Date)]$V1,.(Sixth_Worst = y), keyby = .(Date)]

  • DT[,.I[6],by = .(Date)] uses the special symbol .I to extract the 6th row number for each date
  • the $V1 appended extracts the vector of these row numbers
  • DT is then subsetted using this vector
  • DT is then keyed (and implicitly ordered) and grouped by Date to create a summary table with a new column, Sixth_Worst, based on y

To really understand what's going on, I'd recommend running the following statements.

  • DT[,.I[6],by = .(Date)]
  • DT[,.I[6],by = .(Date)]$V1
  • DT[DT[,.I[6],by = .(Date)]$V1]
  • DT[DT[,.I[6],by = .(Date)]$V1,.(Sixth_Worst = y), keyby = .(Date)]
Matt Summersgill
  • 4,054
  • 18
  • 47
0

An option with dplyr and sort can be as:

Note: One can convert the Date column to POSIXct format before grouping but I didn't notice any advantage as such.

library(dplyr)

df %>% group_by(Date) %>% 
  mutate(Worst6th = sort(y)[6])

# A tibble: 32 x 7
# Groups: Date [4]
    Key Date                      y    x1    x2    x3 Worst6th
  <int> <chr>                 <int> <int> <int> <int>    <int>
1     1 1/10/2018 12:00:00 AM     2     3     2     5        2
2     1 1/11/2018 12:00:00 AM     3     5     7     2        3
3     1 1/12/2018 12:00:00 AM     5     7     4     7        5
4     1 1/13/2018 12:00:00 AM     7     2     7     6        7
5     2 1/10/2018 12:00:00 AM     2     6     3     8        2
6     2 1/11/2018 12:00:00 AM     3     7     7     3        3
7     2 1/12/2018 12:00:00 AM     3     2     3     4        5
8     2 1/13/2018 12:00:00 AM     7     6     2     7        7
9     3 1/10/2018 12:00:00 AM     2     3     2     5        2
10     3 1/11/2018 12:00:00 AM     3     5     7     2        3
# ... with 22 more rows      

Data:

df <- read.table(text="
Key     Date                     y   x1   x2   x3
1    '1/10/2018 12:00:00 AM'    2   3    2    5
1    '1/11/2018 12:00:00 AM'    3   5    7    2
1    '1/12/2018 12:00:00 AM'    5   7    4    7 
1    '1/13/2018 12:00:00 AM'    7   2    7    6
2    '1/10/2018 12:00:00 AM'    2   6    3    8
2    '1/11/2018 12:00:00 AM'    3   7    7    3
2    '1/12/2018 12:00:00 AM'    3   2    3    4
2    '1/13/2018 12:00:00 AM'    7   6    2    7
3    '1/10/2018 12:00:00 AM'    2   3    2    5
3    '1/11/2018 12:00:00 AM'    3   5    7    2
3    '1/12/2018 12:00:00 AM'    5   7    4    7 
3    '1/13/2018 12:00:00 AM'    7   2    7    6
3    '1/10/2018 12:00:00 AM'    2   6    3    8
3    '1/11/2018 12:00:00 AM'    3   7    7    3
3    '1/12/2018 12:00:00 AM'    3   2    3    4
3    '1/13/2018 12:00:00 AM'    7   6    2    7
4    '1/10/2018 12:00:00 AM'    2   3    2    5
4    '1/11/2018 12:00:00 AM'    3   5    7    2
4    '1/12/2018 12:00:00 AM'    5   7    4    7 
4    '1/13/2018 12:00:00 AM'    7   2    7    6
4    '1/10/2018 12:00:00 AM'    2   6    3    8
4    '1/11/2018 12:00:00 AM'    3   7    7    3
5    '1/12/2018 12:00:00 AM'    3   2    3    4
5    '1/13/2018 12:00:00 AM'    7   6    2    7
5    '1/10/2018 12:00:00 AM'    2   3    2    5
5    '1/11/2018 12:00:00 AM'    3   5    7    2
5    '1/12/2018 12:00:00 AM'    5   7    4    7 
5    '1/13/2018 12:00:00 AM'    7   2    7    6
6    '1/10/2018 12:00:00 AM'    2   6    3    8
6    '1/11/2018 12:00:00 AM'    3   7    7    3
6    '1/12/2018 12:00:00 AM'    3   2    3    4
6    '1/13/2018 12:00:00 AM'    7   6    2    7",
header = TRUE, stringsAsFactors = FALSE)
MKR
  • 19,739
  • 4
  • 23
  • 33
  • 1
    This worked well: `df %>% group_by(Date) %>% mutate(Worst6th = sort(y)[6])` Thank you! – g3lo Jun 26 '18 at 20:20