Summary: I'm trying to use pandas dataframes to store historical stock option chain information. I have the indexes set up in the following order:
- quote_datetime: This represents a specific time/candle that the row is from. There will be many rows of data for a specific quote_datetime.
- expiration: Options have an expiration date, and there are many expiration dates available at a given point in time.
- strike: The strike price for a given option
- option_type: either a P or a C for put or call.
With these 4 indexes, you can select any single row of data.
Problem: The problem isn't getting a row in question, it's trying to look up valid combinations of index values without extra information included. For example, what if I want to know which option expiries are available for trade at a specific date (quote_datetime is known, and I want to return all unique 'expirations' matching the quote_datetime key). Or I want to know all strikes available for a given quote_datetime and expiration. In these examples, I don't care about the data, I'm trying to look up which index keys are valid with only a few of the index values known.
Example: I'm dropping the option_type index and lots of data columns for this example to try and keep it small.
oc = { 'quote_datetime': ['2020-08-01', '2020-08-01', '2020-08-01', '2020-08-01', '2020-08-01', '2020-08-01', '2020-08-01', '2020-08-01', '2020-08-01', '2020-08-02', '2020-08-02', '2020-08-02', '2020-08-02', '2020-08-02', '2020-08-02'],
'expiration': ['2020-08-01', '2020-08-01', '2020-08-01', '2020-08-03', '2020-08-03', '2020-08-03', '2020-08-05', '2020-08-05', '2020-08-05', '2020-08-03', '2020-08-03', '2020-08-03', '2020-08-05', '2020-08-05', '2020-08-05'],
'strike': [10, 15, 20, 10, 15, 20, 10, 15, 20, 10, 15, 20, 10, 15, 20],
'price':[3, 2, 1, 4, 3, 2, 5, 4, 3, 3.5, 2.5, 1.5, 4.5, 3.5, 2.5]}
df = pd.DataFrame(data=oc)
df = df.set_index(['quote_datetime','expiration','strike'])
df = df.sort_index()
This gives us a dataframe that looks like:
price
quote_datetime expiration strike
2020-08-01 2020-08-01 10 3.0
15 2.0
20 1.0
2020-08-03 10 4.0
15 3.0
20 2.0
2020-08-05 10 5.0
15 4.0
20 3.0
2020-08-02 2020-08-03 10 3.5
15 2.5
20 1.5
2020-08-05 10 4.5
15 3.5
20 2.5
Let's say I want to see all expiries available on August 2nd.
df.loc['2020-08-02'].index.levels[0]
I expect to receive ['2020-08-03', '2020-08-05'], but instead get
Index(['2020-08-01', '2020-08-03', '2020-08-05'], dtype='object', name='expiration')
'2020-08-01' isn't a valid expiry for the quote_datetime I used in the .loc[]. It seems that the .levels simply returns every index in the dataframe, ignoring that I filtered rows using .loc.
I've also tried
df.loc['2020-08-02'].index.get_level_values(0)
but it returns every row instead of the unique indexes. This almost works, except that I'd have to run it through a function to get the unique combinations.
Index(['2020-08-03', '2020-08-03', '2020-08-03', '2020-08-05', '2020-08-05',
'2020-08-05'],
dtype='object', name='expiration')
This is being done on a 20 GB .csv with a lot of rows, so I'm trying to keep it light and fast if at all possible... Though at this point, it would be nice just to get the data I need. I'm not very well versed in using python for this type of work, but doing so allows me to take advantage of libraries that already exist.