Pull subset of rows of dataframe based on conditions from other columns

Question

I have a dataframe like the one below:

x <- data.table(Tickers=c("A","A","A","B","B","B","B","D","D","D","D"),
                Type=c("put","call","put","call","call","put","call","put","call","put","call"),
                Strike=c(35,37.5,37.5,10,11,11,12,40,40,42,42),
                Other=sample(20,11))

    Tickers Type Strike Other
 1:       A  put   35.0     6
 2:       A call   37.5     5
 3:       A  put   37.5    13
 4:       B call   10.0    15
 5:       B call   11.0    12
 6:       B  put   11.0     4
 7:       B call   12.0    20
 8:       D  put   40.0     7
 9:       D call   40.0    11
10:       D  put   42.0    10
11:       D call   42.0     1

I am trying to analyze a subset of the data. The subset I would like to take is data where the ticker and strike are the same. But I also only want to grab this data if both a put and a call exists under type. With the data above for example, I would like to return the following result:

x[c(2,3,5,6,8:11),]

   Tickers Type Strike Other
1:       A call   37.5     5
2:       A  put   37.5    13
3:       B call   11.0    12
4:       B  put   11.0     4
5:       D  put   40.0     7
6:       D call   40.0    11
7:       D  put   42.0    10
8:       D call   42.0     1

I'm not sure what the best way to go about doing this. My thought process is that I should create another column vector like

x$id <- paste(x$Tickers,x$Strike,sep="_")

Then use this vector to only pull values where there are multiple ids.

x[x$id %in% x$id[duplicated(x$id)],]

   Tickers Type Strike Other     id
1:       A call   37.5     5 A_37.5
2:       A  put   37.5    13 A_37.5
3:       B call   11.0    12   B_11
4:       B  put   11.0     4   B_11
5:       D  put   40.0     7   D_40
6:       D call   40.0    11   D_40
7:       D  put   42.0    10   D_42
8:       D call   42.0     1   D_42

I'm not sure how efficient this is, as my actual data consists of a lot more rows. Also, this solution does not check for the type condition of there being one put and one call.

also the wording of the title could be a lot better, I apologize

EDIT::: having checked out this post Finding ALL duplicate rows, including "elements with smaller subscripts"

I could also use this solution:

x$id <- paste(x$Tickers,x$Strike,sep="_")
x[duplicated(x$id) | duplicated(x$id,fromLast=T),]

Mike H. · Accepted Answer · 2018-05-11T14:39:54.830

3

You could try something like:

x[, select := (.N >= 2 & all(c("put", "call") %in% unique(Type))), by = .(Tickers, Strike)][which(select)]

#   Tickers Type Strike Other select
#1:       A call   37.5    17   TRUE
#2:       A  put   37.5    16   TRUE
#3:       B call   11.0    11   TRUE
#4:       B  put   11.0    20   TRUE
#5:       D  put   40.0     1   TRUE
#6:       D call   40.0    12   TRUE
#7:       D  put   42.0     6   TRUE
#8:       D call   42.0     2   TRUE

Another idea might be a merge:

x[x, on = .(Tickers, Strike), select := (length(Type) >= 2 & all(c("put", "call") %in% Type)),by = .EACHI][which(select)]

I'm not entirely sure how to get around the group-by operations since you want to make sure for each group they have both "call" and "put". I was thinking about using keys, but haven't been able to incorporate the "call"/"put" aspect.

edited May 11 '18 at 14:39

answered May 11 '18 at 14:27

Mike H.

13,960
2
29
39

I like this - the `all(c("put", "call") %in% unique(Type))` part is more robust than my solution. – Brian Stamper May 11 '18 at 14:30
Thanks @BrianStamper, it was inspired in part by your solution so thank you! – Mike H. May 11 '18 at 14:30
And the `which(select)` construct is one I am going to have to remember! – Brian Stamper May 11 '18 at 14:39
Very nice solution! I really appreciate the `all(c("put", "call") %in% unique(Type))` part as well. This allows for more complex subsetting if my `type` variable were to include other combinations of options. If I were to want to include another column as part of the group, I could also simply add it `by = .(Tickers, Strike)` correct? – road_to_quantdom May 11 '18 at 14:39

Brian Stamper · Answer 2 · 2018-05-11T14:26:42.927

An edit to your data to give a case where both put and call does not exist (I changed the very last "call" to "put"):

x <- data.table(Tickers=c("A","A","A","B","B","B","B","D","D","D","D"),
            Type=c("put","call","put","call","call","put","call","put","call","put","put"),
            Strike=c(35,37.5,37.5,10,11,11,12,40,40,42,42),
            Other=sample(20,11))

Since you are using data.table, you can use the built in counter .N along with by variables to count groups and subset with that. If by counting Type you can reliably determine there is both put and call, this could work:

x[, `:=`(n = .N, types = uniqueN(Type)), by = c('Tickers', 'Strike')][n > 1 & types == 2]

The part enclosed in the first set of [] does the counting, and then the [n > 1 & types == 2] does the subsetting.

i like this solution! I will create sample data that is larger and more in line with the size of my full data and compare results soon! — road_to_quantdom, May 11 '18 at 14:23

score 0 · Answer 3 · answered May 11 '18 at 14:29

I am not a user of package data.table so this code is base R only.

agg <- aggregate(Type ~ Tickers + Strike, data = x, length)
result <- merge(x, subset(agg, Type > 1)[1:2], by = c("Tickers", "Strike"))[, c(1, 3, 2, 4)]
result
#   Tickers Type Strike Other
#1:       A call   37.5    17
#2:       A  put   37.5     7
#3:       B call   11.0    14
#4:       B  put   11.0    20
#5:       D  put   40.0    15
#6:       D call   40.0     2
#7:       D  put   42.0     8
#8:       D call   42.0     1


rm(agg)    # final clean up

Pull subset of rows of dataframe based on conditions from other columns

3 Answers3