4

I have about 50 datasets that include all trades within a timeframe of 30 days for about 10 pairs on 5 exchanges. All pairs are of the same asset class, meaning they are strongly correlated and expect to have similar properties, but are on different scales. An example of this data would be

set.seed(1)

n <- 1000
dates <- seq(as.POSIXct("2019-08-05 00:00:00", tz="UTC"), as.POSIXct("2019-08-05 23:59:00", tz="UTC"), by="1 min")
x <- data.frame("t" = sort(sample(dates, 1000)),"p" = cumsum(sample(c(-1, 1), n, TRUE)))

Plot example

Roughly, I need to identify the relevant local minima and maxima, which happen daily. The yellow marks are my points of interest. Unlike this example, there is usually only one such point per day and I consider each day separately. However, it is hard to filter out noise from my actual points of interest.

My actual goal is to find the exact point, at which the pair started to make a jump and the exact point, at which the jump is over. This needs to be as accurate as possible, as I want to observe which asset moved first and which asset followed at which point in time (as said, they are highly correlated). Between two extreme values, I want to minimize the distance and maximize the relative/absolute change, as my points of interest are usually close to each other and their difference is quite large.

I already looked at other questions like Finding local maxima and minima and Algorithm to locate local maxima and also this algorithm that has the same goal. However, my dataset is extremely noisy. I already reduced the dataset to 5-minute intervals, however, this has led to omitting the relevant points in the functions to identify local minima & maxima. Therefore, this was a not good solution given my goal.

How can I achieve my goal with a quite accurate algorithm? Manually skimming through all the time-series is not an option, since this would require me to evaluate 50 * 30 time-series manually, which is too time-consuming. I'm really puzzled and trying to find a suitable solution for a week.

If more code snippets are demanded, I'm happy to share, however they didn't give me meaningful results, which would be opposed to the idea of providing a minimum working example, therefore I decided to leave them out for now.

EDIT: First off, I updated the plot and added timestamps to the dataset to give you an idea (the actual resolution). Ideally, the algorithm would detect both jumps on the left. The inner two dots because they're closer together and jump without interception, and the outer dots because they're more extreme in values. In fact, this maybe answers the question whether the algorithm is allowed to look into the future. Yes, if there's another local extrema in the range of, say, 30 observations (or 30 minutes), then ignore the intermediate local extrema. In my data, jumps have been from 2% - ~ 15%, such that a jump needs to be at least 2% to be considered. And only if a threshold of 15 (this might be adaptable) consecutive steps in the same direction before / after the peaks and valleys is reached.

A very naive approach was to subset the data around the global minimum and maximum of a day. In most cases, this has denoised data and worked as an indicator. However, this is not robust when the global extrema are not in the range of the jump.

Hope this clarifies why this isn't a statistical question (there are some tests to determine whether a jump has happened, but not for jump arrival time afaik).


In case anyone wants a real example: this is a corresponding graph, this is the raw data of the relevant period and this is the reduced dataset.


zonfl
  • 328
  • 2
  • 8
  • have a look at https://facebook.github.io/prophet/ and this tweet thread is very helpful https://twitter.com/seanjtaylor/status/1123278380369973248 – infominer May 07 '19 at 18:00
  • Please check ["Which site?"](https://meta.stackexchange.com/questions/129598/which-computer-science-programming-stack-exchange-do-i-post-in) for general issues. This is a higher-level problem than we handle on this site; I suggest Stack Exchange Statistics. – Prune May 07 '19 at 18:09
  • Maybe my problem description is flawed (in which case I apologize and will be sure to revise the description), but this is mainly an algorithmic problem. I don't need a tool like prophet to make forecasts for me, and there is unfortunately no statistical solution available for this specific problem. Can you let me know in what way this problem is too high-level and I will accordingly clarify. – zonfl May 07 '19 at 18:41
  • 1
    Your description is not 'flawed', but incomplete. For instance, your second dot from the left marks a high, but slightly to the right is a higher high. Why do you not choose that one? You need to define rules for a local extremum: Is the algorithm allowed to look into the future? What happens if two local extrema are close together (as happens in the middle of the chart)? How much needs a local extremum differ from the surrounding points? Over what range should an extreme point be computed? And so on... – Enrico Schumann May 08 '19 at 07:16
  • Valid points! I will make an edit – zonfl May 08 '19 at 08:45

1 Answers1

2

Perhaps as a starting point, look at function streaks in package PMwR (which I maintain). A streak is defined as a move of a specified size that is uninterrupted by a countermove of the same size. The function works with returns, not differences, so I add 100 to your data.

For instance:

set.seed(1)
n <- 1000
x <- 100 + cumsum(sample(c(-1, 1), n, TRUE))

plot(x, type = "l")
s <- streaks(x, up = 0.12, down = -0.12)
abline(v = s[, 1])
abline(v = s[, 2])

The vertical lines show the starts and ends of streaks.

Streaks

Perhaps you can then filter the identified streaks by required criteria such as length. Or you may play around with different thresholds for up and down moves (though this is not really recommended in the current implementation, but perhaps the results are good enough). For instance, up streaks might look as follows. A green vertical shows the start of a streak; a red line shows its end.

plot(x, type = "l")
s <- streaks(x, up = 0.12, down = -0.05)
s <- s[!is.na(s$state) & s$state == "up", ]
abline(v = s[, 1], col = "green")
abline(v = s[, 2], col = "red")

Up streaks

Enrico Schumann
  • 1,278
  • 7
  • 8
  • Thanks a lot for the effort! That looks really promising, I will have a look and report after I tested it on my real dataset – zonfl May 09 '19 at 10:33
  • This seems to be exactly what I'm looking for! Also, I'm glad to see that it is very rebust to noise. However, the results are still random: https://imgur.com/a/viIHWco (not working), https://imgur.com/a/I1xm5zX (working). My real input data is on heterogeneous scales, hence the results are not always as expected. Here are the current prices of some pairs: `0.5; 10; 60; 180; 1400; 5200.` I would like to rescale them, but that distorts the returns (a jump from 5200 - 5400 is not equal to a jump from 99 to 101). – zonfl May 09 '19 at 17:32
  • 1
    Have you adjusted the up/down parameters? They should reflect the volatility of the underlying series. – Enrico Schumann May 10 '19 at 08:07
  • I didn't put the thresholds low enough, now it's working. Checked out the rest of your package as well, really useful and great documentation! Thanks again, your solution saved me a lot of time – zonfl May 10 '19 at 10:25