1

I have transaction data that I hope to resample in a fashion similar to OHLC stock market prices.

  • The goal is to display a meaningful price summary for each day

However, my challenge is that the transactional data is sparse.

  • There are days without transactions

  • The opening price of the day might happen in the middle of the day and does not roll over from the previous day automatically

I can do the naive OHLC with resample(), as seen in the example below. Because the data is sparse, the normal straightforward resample gives unideal results. To get a meaningful price sample, the following conditions should be met too:

  • The opening price is always set to the closing price of the previous day

  • If any day does not have transactions, all OHLC values are the closing price of the previous day ("price does not move")

I can do this in pure Python, as it is not that difficult, but it is not very computationally efficient for high volumes of data. Thus, my question is, would Pandas offer any clever way of doing resample or aggregate satisfying the conditions above, but without needing to loop values in Python manually?

The example code is below:

import pandas as pd

# Transactions do not have regular intervals and may miss days
data = {
    "timestamp": [
        pd.Timestamp("2020-01-01 01:00"),
        pd.Timestamp("2020-01-01 05:00"),
        pd.Timestamp("2020-01-02 03:00"),
        pd.Timestamp("2020-01-04 04:00"),
        pd.Timestamp("2020-01-05 00:00"),
    ],
    "transaction": [
        100.00,
        102.00,
        103.00,
        102.80,
        99.88
    ]
}

df = pd.DataFrame.from_dict(data, orient="columns")
df.set_index("timestamp", inplace=True)

print(df)
                      transaction
timestamp                       
2020-01-01 01:00:00       100.00
2020-01-01 05:00:00       102.00
2020-01-02 03:00:00       103.00
2020-01-04 04:00:00       102.80
2020-01-05 00:00:00        99.88
# https://stackoverflow.com/a/36223274/315168
naive_resample = df["transaction"].resample("1D") .agg({'open': 'first', 'high': 'max', 'low': 'min', 'close': 'last'})
print(naive_resample)

In this result, you can see that:

  • open/close do not match over daily boundaries

  • if a day does not have transactions price is marked as NaN

           open    high     low   close
timestamp                                 
2020-01-01  100.00  102.00  100.00  102.00
2020-01-02  103.00  103.00  103.00  103.00
2020-01-03     NaN     NaN     NaN     NaN
2020-01-04  102.80  102.80  102.80  102.80
2020-01-05   99.88   99.88   99.88   99.88
Mikko Ohtamaa
  • 82,057
  • 50
  • 264
  • 435

1 Answers1

1

You can use following logic:

  • Shift "close" to next row as "prev_close", to use it for next row processing.
  • If "open" == NaN, fill in "prev_close" for OHLC.
  • Match "open" with "prev_close" for all.
# Fill missing days "close" with last known value.
naive_resample["close"] = naive_resample["close"].fillna(method="ffill")

# Shift "close" to next row, to use it for next row processing.
naive_resample["prev_close"] = naive_resample["close"].shift(1)

# First transaction has no "prev_close". Adjust it to prevent NaN spill over.
naive_resample.iloc[0, naive_resample.columns.get_loc("prev_close")] = naive_resample.iloc[0, naive_resample.columns.get_loc("open")]

def adjust_ohlc(row):
  # Process missing day
  if math.isnan(row["open"]):
    return pd.Series([row["prev_close"]] * 4)
  else:
    # Adjust "open" with "prev_close"
    return pd.Series([row["prev_close"], row["high"], row["low"], row["close"]])

naive_resample[["open", "high", "low", "close"]] = naive_resample.apply(adjust_ohlc, axis=1)

naive_resample = naive_resample.drop("prev_close", axis=1)

Output:

             open    high     low   close
timestamp                                
2020-01-01  100.0  102.00  100.00  102.00
2020-01-02  102.0  102.00  102.00  102.00
2020-01-03  102.0  103.00  103.00  103.00
2020-01-04  103.0  102.80  102.80  102.80
2020-01-05  102.8   99.88   99.88   99.88
Azhar Khan
  • 3,829
  • 11
  • 26
  • 32