0

I have the following MultiIndex created with df.groupby(...).resample(). It is stock market-like OHLC data grouped by an asset and then having OHLC candle time-series for this asset.

                       high     low   close  ...  avg_trade  buys  sells
pair     timestamp                           ...                        
AAVE-ETH 2020-01-01    80.0    80.0    80.0  ...     1280.0     1      0
         2020-01-02    96.0    96.0    96.0  ...     1120.0     1      0
ETH-USDC 2020-01-02  1600.0  1600.0  1600.0  ...     5000.0     1      0
         2020-01-05  1620.0  1400.0  1400.0  ...     1125.0     1      1

The df.index content is:

MultiIndex([('AAVE-ETH', '2020-01-01'),
            ('AAVE-ETH', '2020-01-02'),
            ('ETH-USDC', '2020-01-02'),
            ('ETH-USDC', '2020-01-05')],
           names=['pair', 'timestamp'])

I would like to do a DataFrame.truncate() like operation by the second index (timestamp) so that I discard all entries beyond a certain timestamp.

However the naive df.truncate(timestamp) will give an error:

TypeError: Level type mismatch: 2020-01-04 00:00:00

Can a grouped data frame be truncated by its second index (time series) somehow?

Mikko Ohtamaa
  • 82,057
  • 50
  • 264
  • 435
  • can you provide example code and desired output? – Panda Kim Dec 14 '22 at 00:07
  • Note for moderators: this question is about truncating `DataFrame` - not about selecting or filtering the rows. – Mikko Ohtamaa Dec 14 '22 at 00:37
  • [`pandas.DataFrame.truncate`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.truncate.html): _This is a useful shorthand for boolean indexing based on index values above or below certain thresholds._ And Boolean indexing is selecting and filtering. As such, this is in fact a duplicate of https://stackoverflow.com/q/53927460/7758804, https://stackoverflow.com/q/66681726/7758804, https://stackoverflow.com/q/72109415/7758804 – Trenton McKinney Dec 14 '22 at 00:42
  • @Mikko Ohtamaa Regardless of indexing or slicing or filtering or `xs` or `take` or some way using truncate, answerer runs code. So need code. For questions where the problem can be easily identified by eye, looking at `pandas document` will suffice. And for str, text is sufficient, but for datetime, code is required for run exactly. – Panda Kim Dec 14 '22 at 00:47

1 Answers1

1

You can use get_level_values for "timestamp" and filter by required date literal:

# Create dataframe
df = pd.DataFrame(data=[["AAVE-ETH","2020-01-01",80.0,80.0,80.0,80.0],["AAVE-ETH","2020-01-02",96.0,96.0,96.0,96.0],["AAVE-ETH","2020-01-03",95.0,95.0,95.0,95.0],["ETH-USDC","2020-01-01",96.0,96.0,96.0,96.0],["ETH-USDC","2020-01-02",97.0,97.0,97.0,97.0],["ETH-USDC","2020-01-03",98.0,98.0,98.0,98.0]], columns=["pair","timestamp","high","low","close","open"])
df["timestamp"] = df["timestamp"].apply(pd.to_datetime)

# Apply index
index = pd.MultiIndex.from_frame(df[["pair", "timestamp"]])
df = df.set_index(index).drop(["pair", "timestamp"], axis=1)

                     high   low  close  open
pair     timestamp                          
AAVE-ETH 2020-01-01  80.0  80.0   80.0  80.0
         2020-01-02  96.0  96.0   96.0  96.0
         2020-01-03  95.0  95.0   95.0  95.0
ETH-USDC 2020-01-01  96.0  96.0   96.0  96.0
         2020-01-02  97.0  97.0   97.0  97.0
         2020-01-03  98.0  98.0   98.0  98.0

Filter by date literal (fox ex. keep entries after 2020-01-01):

# Filter by date
df = df[df.index.get_level_values("timestamp") > pd.Timestamp(2020, 1, 1)]

                     high   low  close  open
pair     timestamp                          
AAVE-ETH 2020-01-02  96.0  96.0   96.0  96.0
         2020-01-03  95.0  95.0   95.0  95.0
ETH-USDC 2020-01-02  97.0  97.0   97.0  97.0
         2020-01-03  98.0  98.0   98.0  98.0
Azhar Khan
  • 3,829
  • 11
  • 26
  • 32