1

In my df I have a multiindex like this:

df.index.names
FrozenList([u'Ticker', u'Date'])

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 189667 entries, (AAPL, 1992-08-31 00:00:00) to (^DJI, 2017-08-31 00:00:00)

On a single index df I would do:

from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.2, shuffle=False)

However, this does not work with multiindex, it just cuts the rows in 80/20.

Note: I do not want random sampling, just splitting 80/20 based on date.

Any clues?

Edit:

This is how I fetch the data in question (apart from many more than two tickers):

import pandas as pd
import pandas_datareader.data as web

tickers = ['AAPL', 'AXP']

def get_data(tickers):
    ''' Dowloads daily O/H/L/C data for all symbols'''
    def data(ticker):
        return web.DataReader(ticker, 'yahoo')
    datas = map(data, tickers)
    return pd.concat(datas, keys=tickers, names=['Ticker', 'Date'])

stock_data = get_data(tickers)
cJc
  • 813
  • 1
  • 11
  • 33

1 Answers1

4

Here is one way of doing so:
first get the test group by groupby ticker(index level 0) then sorting each resulting group by date (descending order) then getting the first 20% of the data using selection

df_test = stock_data.groupby(level=0).apply(
    lambda group: group.sort_index(
        ascending=False).iloc[:int(len(group) * .2)]
).reset_index(level=0, drop=True)

df_train will be all the records in stock_data and not in df_test, we can get df_train using boolean masking on the multiple index

df_train = stock_data[~stock_data.index.isin(df_test.index)]

alternatively using the same code for df_test where .2 is replaced by ,8 and ascending=False is replaced by ascending=True

sgDysregulation
  • 4,309
  • 2
  • 23
  • 31