2

Summary: I'm trying to use pandas dataframes to store historical stock option chain information. I have the indexes set up in the following order:

  1. quote_datetime: This represents a specific time/candle that the row is from. There will be many rows of data for a specific quote_datetime.
  2. expiration: Options have an expiration date, and there are many expiration dates available at a given point in time.
  3. strike: The strike price for a given option
  4. option_type: either a P or a C for put or call.

With these 4 indexes, you can select any single row of data.

Problem: The problem isn't getting a row in question, it's trying to look up valid combinations of index values without extra information included. For example, what if I want to know which option expiries are available for trade at a specific date (quote_datetime is known, and I want to return all unique 'expirations' matching the quote_datetime key). Or I want to know all strikes available for a given quote_datetime and expiration. In these examples, I don't care about the data, I'm trying to look up which index keys are valid with only a few of the index values known.

Example: I'm dropping the option_type index and lots of data columns for this example to try and keep it small.

oc = { 'quote_datetime': ['2020-08-01', '2020-08-01', '2020-08-01', '2020-08-01', '2020-08-01', '2020-08-01', '2020-08-01', '2020-08-01', '2020-08-01', '2020-08-02', '2020-08-02', '2020-08-02', '2020-08-02', '2020-08-02', '2020-08-02'],
       'expiration': ['2020-08-01', '2020-08-01', '2020-08-01', '2020-08-03', '2020-08-03', '2020-08-03', '2020-08-05', '2020-08-05', '2020-08-05', '2020-08-03', '2020-08-03', '2020-08-03', '2020-08-05', '2020-08-05', '2020-08-05'],
       'strike': [10, 15, 20, 10, 15, 20, 10, 15, 20, 10, 15, 20, 10, 15, 20],
       'price':[3, 2, 1, 4, 3, 2, 5, 4, 3, 3.5, 2.5, 1.5, 4.5, 3.5, 2.5]}


df = pd.DataFrame(data=oc)
df = df.set_index(['quote_datetime','expiration','strike'])
df = df.sort_index()

This gives us a dataframe that looks like:

                                  price
quote_datetime expiration strike       
2020-08-01     2020-08-01 10        3.0
                          15        2.0
                          20        1.0
               2020-08-03 10        4.0
                          15        3.0
                          20        2.0
               2020-08-05 10        5.0
                          15        4.0
                          20        3.0
2020-08-02     2020-08-03 10        3.5
                          15        2.5
                          20        1.5
               2020-08-05 10        4.5
                          15        3.5
                          20        2.5

Let's say I want to see all expiries available on August 2nd.

df.loc['2020-08-02'].index.levels[0]

I expect to receive ['2020-08-03', '2020-08-05'], but instead get

Index(['2020-08-01', '2020-08-03', '2020-08-05'], dtype='object', name='expiration')

'2020-08-01' isn't a valid expiry for the quote_datetime I used in the .loc[]. It seems that the .levels simply returns every index in the dataframe, ignoring that I filtered rows using .loc.

I've also tried

df.loc['2020-08-02'].index.get_level_values(0)

but it returns every row instead of the unique indexes. This almost works, except that I'd have to run it through a function to get the unique combinations.

Index(['2020-08-03', '2020-08-03', '2020-08-03', '2020-08-05', '2020-08-05',
       '2020-08-05'],
      dtype='object', name='expiration')

This is being done on a 20 GB .csv with a lot of rows, so I'm trying to keep it light and fast if at all possible... Though at this point, it would be nice just to get the data I need. I'm not very well versed in using python for this type of work, but doing so allows me to take advantage of libraries that already exist.

Jon V
  • 25
  • 1
  • 5

2 Answers2

2

You can combine .loc and index.unique() together and pass the required level to get the indices at other levels. Also, since these are multi-indices, I'd suggest you to use tuple for using .loc even for a single level index to make everything look consistent.

# To get unique expiration indices for given quote date index

>>> df.loc[('2020-08-02',)].index.unique(0)
Index(['2020-08-03', '2020-08-05'], dtype='object', name='expiration')


#To get unique strike indices for a given quote date index

>>> df.loc[('2020-08-02',)].index.unique(1)
Int64Index([10, 15, 20], dtype='int64', name='strike')

#To know the strike indices for given quote and expiration date indices

>>> df.loc[('2020-08-01', '2020-08-01')].index.unique(0)
Int64Index([10, 15, 20], dtype='int64', name='strike')
ThePyGuy
  • 17,779
  • 5
  • 18
  • 45
  • Could be useful to note that you can supply the name of the level instead of an integer based position to `.unique(...)`, so `df.loc[('2020-08-02',)].index.unique(0)` could (more explicitly) be written as `df.loc[('2020-08-02',)].index.unique("expiration")` – Cameron Riddell Jul 28 '21 at 19:43
  • @CameronRiddell, yeah we can do that as well, it just depends on the scenario. But using `named` index is indeed more readable. – ThePyGuy Jul 28 '21 at 19:47
  • 1
    This is very interesting. I'll need to give it a try once I get back home. I don't fully understand what a tuple is and why I would use it (I'm primarily an embedded C programmer). Let me do some googling to read up on it. – Jon V Jul 28 '21 at 20:08
  • 1
    functionally, a tuple is kinda like a list but more generic and identified syntactically as items within (item1, item2, etc) instead of square brackets. With regard to datetime and specific to @ThePyGuy's post, pandas datetime slicing requires a tuple. So a single date looks like ('2020-08-02', '2020-08-02') or all dates (inclusive) after a certain date ('2020-08-02', ) – CreekGeek Jul 28 '21 at 20:31
1

The main problem (based on the code you provided) is that your date fields are not datetime types.

So, first confirm that 'quote_datetime' and 'expiration' are datetime types. Assuming they are not, you can alter existing fields in-place

df.reset_index(inplace=True)

# make cols datetime
df.quote_datetime = pd.to_datetime(df.quote_datetime)
df.expiration = pd.to_datetime(df.expiration)

df = df.set_index(['quote_datetime','expiration','strike'])

Or, I prefer starting with the initial csv import statement which should look something like:

df = pd.read_csv("file.csv", header=0, parse_dates=['quote_datetime', 'expiration'], keep_date_col=False)

At this point, you can slice based on the dates (as attempted in your post). To return a subsetted dataframe you can work with:

# see https://stackoverflow.com/a/45270356/9249533
df.loc[pd.IndexSlice['2020-08-02':'2020-08-02',:], :]

or to get list of unique dates as you wanted:

df.loc[pd.IndexSlice['2020-08-02':'2020-08-02',:], :].index.unique(level=1).tolist()

which returns a list of timestamps you can reformat as needed.

[Timestamp('2020-08-03 00:00:00'), Timestamp('2020-08-05 00:00:00')]

[i.strftime('%Y-%m-%d') for i in lst]

to get

['2020-08-03', '2020-08-05']

You're right to attempt to use via the Index. Running on regular fields can be done, but is verbose and I suspect suffer performance given your file size. For comparative purposes (assuming you've run my first block without the last line)...

# add logical test to subset data
d_test = '2020-08-02'
df['expiration'].loc[df['quote_datetime'] >= d_test].unique()

returns

array(['2020-08-03T00:00:00.000000000', '2020-08-05T00:00:00.000000000'], dtype='datetime64[ns]')

So, adapting the logic from this post:

[dt.fromtimestamp(t.item() / 10**9).strftime('%Y-%m-%d') for t in df['expiration'].loc[df['quote_datetime'] >= d].unique()]

['2020-08-03', '2020-08-05']

CreekGeek
  • 1,809
  • 2
  • 14
  • 24
  • Good catch, my example code doesn't do this. My actual dataset does interpret them as datetime types. Let me try your suggested ways to get the index on the bigger dataset and confirm that it was just the method i was using to get the subsets. Will report back when I get home – Jon V Jul 28 '21 at 20:01