This likely isn't an easy question. I am looking at public trade data, and trying to group 'related' items. But the data itself isn't perfect (the human element) or is misleading. The good news, is that related items generally follow similar patterns.
Here's an example
Notional | Premium | Strike | Structure | StartDate | EndDate | Index | ExecutionTimestamp | ResetFreq | |
---|---|---|---|---|---|---|---|---|---|
10 | 500,000,000 | 9,125,000 | 0.02925 | Call | 2024-06-30 | 2025-06-30 | SOFR-OIS | 2022-06-28T13:41:01 | 3 |
11 | 500,000,000 | 9,125,000 | 0.02925 | Call | 2024-06-30 | 2025-06-30 | SOFR-OIS | 2022-06-28T13:41:01 | 3 |
12 | 250,000,000 | 3,837,500 | 0.03255 | Call | 2023-06-30 | 2024-06-30 | SOFR-OIS | 2022-06-28T14:36:15 | 3 |
13 | 250,000,000 | 3,837,500 | 0.03255 | Call | 2023-06-30 | 2024-06-30 | SOFR-OIS | 2022-06-28T14:37:11 | 3 |
14 | 380,000,000 | 1,633,999.99462 | 0.02473 | Put | 2023-06-30 | 2024-06-30 | SOFR-OIS | 2022-06-28T16:40:37 | 3 |
15 | 380,000,000 | 1,633,999.99462 | 0.02473 | Put | 2023-06-30 | 2024-06-30 | SOFR-OIS | 2022-06-28T16:40:37 | 3 |
16 | 130,000,000 | 987,999.99952 | 0.03223 | Call | 2023-06-30 | 2024-06-30 | SOFR-OIS | 2022-06-28T16:41:00 | 3 |
17 | 130,000,000 | 987,999.99952 | 0.03223 | Call | 2023-06-30 | 2024-06-30 | SOFR-OIS | 2022-06-28T16:41:00 | 3 |
18 | 130,000,000 | 987,999.99952 | 0.03223 | Put | 2023-06-30 | 2024-06-30 | SOFR-OIS | 2022-06-28T16:41:16 | 3 |
19 | 130,000,000 | 987,999.99952 | 0.03223 | Put | 2023-06-30 | 2024-06-30 | SOFR-OIS | 2022-06-28T16:41:16 | 3 |
A lot of real similar looking things here! One specific note on the columns:
Structure is misleading. In reality, there are 3 possible values [Call, Put, Straddle]. However, it seems like Put = Put, but Call = Call OR Straddle in the data. A straddle = 1 Put + 1 Call with the same notional, dates and strike.
So in the example I can see that index 16, 17, 18, 19 are very likely to be 2 straddle trades [and I know this to be true]. However, index 14 & 15 are also related to these two straddle trades! The way I would know this is:
- Timestamp is very close in time (<5min-ish)
- The difference in (strike*10000) between 16/17/18/19 and 14/15 is divisible by a factor of 25 [people like round numbers].
- StartDate, EndDate, Index, Reset Freq match as well.
In this case (14,15,16,17,18,19) I would want to return 2x “1 straddle vs 1 put trade with details xyz”. In pandas terms, I would convert the put+call into a straddle where applicable, and be left with two “Group” IDs to identify related trade.
So in general, I guess I'm asking if there is a way to create a list of rules that determine the probability of it being a related trade? I'm open to any suggestions though!
I can edit the question for more specific rules etc if needed.
Tx!