5

I have a financial time series in R (currently an xts object, but I'm also looking into tibble right now).

How do I find the probability of 2 adjacent rows matching a condition?

For example I want to know the probability of 2 consecutive days having a higher than mean/median value. I know I can lag the previous days value into the next row which would allow me to get this statistic, but that seems very cumbersome and inflexible.

Is there a better way to get this done?

xts sample data:

foo <- xts(x = c(1,1,5,1,5,5,1), seq(as.Date("2016-01-01"), length = 7, by = "days"))

What's the probability of 2 consecutive days having a higher than median value?

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
TommyF
  • 6,660
  • 8
  • 37
  • 61
  • 1
    Please provide a minimal [Reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example-aka-mcve-minimal-complete-and-ver). – Heikki Nov 23 '17 at 07:49
  • 1
    I added minimal xts sample data. – TommyF Nov 23 '17 at 07:59

2 Answers2

1

You can create a new column that calls out which are higher than the median, and then take only those that are consecutive and higher

> foo <- as_tibble(data.table(x = c(1,1,5,1,5,5,1), seq(as.Date("2016-01-01"), length = 7, by = "days")))

Step 1

Create column to find those that are higher than median

> foo$higher_than_median <- foo$x > median(foo$x)

Step 2

Compare that column using diff,

Take it only when both are consecutively higher or lower..c(0, diff(foo$higher_than_median) == 0

Then add the condition that they must both be higher foo$higher_than_median == TRUE

Full Expression:

foo$both_higher <- c(0, diff(foo$higher_than_median)) == 0 & $higher_than_median == TRUE

Step 3

To find probability take the mean of foo$both_higher

mean(foo$both_higher)
[1] 0.1428571
Matt W.
  • 3,692
  • 2
  • 23
  • 46
1

Here is a pure xts solution.

How do you define the median? There are several ways.

In an online time series use, like computing a moving average, you can compute the median over a fixed lookback window (shown below), or from the origin up to now (an anchored window calculation). You won't know future values in the median computation beyond the current time step (Avoid look ahead bias).:

library(xts)
library(TTR)

x <- rep(c(1,1,5,1,5,5,1, 5, 5, 5), 10)
y <- xts(x = x, seq(as.Date("2016-01-01"), length = length(x), by = "days"), dimnames = list(NULL, "x"))

# Avoid look ahead bias in an online time series application by computing the median over a rolling fixed time window:
nMedLookback <- 5
y$med <- runPercentRank(y[, "x"], n = nMedLookback)
y$isAboveMed <- y$med > 0.5

nSum <- 2
y$runSum2 <- runSum(y$isAboveMed, n = nSum)

z <- na.omit(y)
prob <- sum(z[,"runSum2"] >= nSum) / NROW(z)

The case where your median is over the entire data set is obviously a much easier modification of this.

FXQuantTrader
  • 6,821
  • 3
  • 36
  • 67
  • Would you suggest a non-xts solution is better suited for financial time series? Judging from your username you have some experience with this ;-) – TommyF Nov 23 '17 at 09:36
  • Have you looked up what xts stands for? ;). As a general rule, if working with an xts object, I would always use xts utilities which are typically fast and based on c implementations. This matters more so for really large objects, such as tick data with 1e8+ rows. At the end of the day though, so whatever your most comfortable with, for small data sets at least. – FXQuantTrader Nov 23 '17 at 17:03