Detecting areas in a Python dataset

Question

I'm trying to deal with a problem that shouldn't be too difficult to solve, but i'm having a very hard time figuring out an approach to work on.

Basically, i have a set of OHLC data:

>>print(df)

                       Open    High     Low   Close       Volume                Date
Date
2020-11-02 00:00:00  396.68  401.01  396.44  400.70  41468.48318 2020-11-02 00:00:00
2020-11-02 00:30:00  400.68  404.50  400.61  402.45  35209.25068 2020-11-02 00:30:00
2020-11-02 01:00:00  402.48  403.14  400.62  401.89  18107.53656 2020-11-02 01:00:00
2020-11-02 01:30:00  401.88  402.88  401.26  402.48  13852.17215 2020-11-02 01:30:00
2020-11-02 02:00:00  402.49  403.85  398.82  401.17  21853.35028 2020-11-02 02:00:00
...                     ...     ...     ...     ...          ...                 ...
2020-11-04 19:30:00  401.88  403.88  401.88  402.46  17944.49509 2020-11-04 19:30:00
2020-11-04 20:00:00  402.50  404.23  397.72  399.59  41674.44864 2020-11-04 20:00:00
2020-11-04 20:30:00  399.60  402.26  399.40  401.21  18606.38545 2020-11-04 20:30:00
2020-11-04 21:00:00  401.20  403.15  400.79  402.70  14408.66482 2020-11-04 21:00:00
2020-11-04 21:30:00  402.69  403.01  401.74  402.71   8873.15569 2020-11-04 21:30:00

Given a fixed range that can be of 10 (so from 350 to 360, 351 to 361 and so on) detect when more than N number of candles closed inside that range. So basically this range needs to "slide" through the whole chart and find zones that meet the criteria i described above (more than N number of candles closed inside that range).

Here is a visual example:

In this case, there are 6 candles closing in the white box, so it's what i'm looking for, note that the candle must not go through the box, it needs to only "start" there.

I tried to make it as clear and detailed as possible. I would like to post more code but i'm really struggling to find a way for this, even though i'm pretty sure it should be easy with Pandas, Numpy or scipy. Can anyone help me find a direction on this? Any kind of advice is welcome

Does this answer your question? [How to select rows in a DataFrame between two values, in Python Pandas?](https://stackoverflow.com/questions/31617845/how-to-select-rows-in-a-dataframe-between-two-values-in-python-pandas) — Chris, Nov 04 '20 at 22:22
That might help, but in my case it's a little bit more complicated: i need to detect all these "zones" that meet the criterias i described above. Thank you! — JayK23, Nov 04 '20 at 22:32

score 1 · Answer 1 · edited Nov 04 '20 at 22:30

1

Your description is a little vague, but maybe this will help:

Say, you have the startingpoints in a numpy array called start, find the places where these points are between 350 and 360 with:

np.where((start > 350) & (start < 360))

To see how many points these are do:

len(np.where((start  >350) & (start  < 360))[0])

edited Nov 04 '20 at 22:30

Dharman

30,962
25
85
135

answered Nov 04 '20 at 22:24

Bas

153
5

Thank you for your answer! Can you tell me which part was vague, so that i can improve the question? This is a start! The problem is that i need to search for those areas in the whole dataset, so the range needs to "shift" through the whole set of data, for example: 350-360, 351-361 and so on. I hope i was not too confusing – JayK23 Nov 04 '20 at 22:31

score 1 · Answer 2 · answered Nov 04 '20 at 23:18

1

I would suggest that you add a loop to your code. It would be somthing like this:

mini = df['close'].min()
maxi  = df['close'].max()

candles = []
for i in range(mini, maxi-10):
    n = len(df[df['Close'].between(i,i+10)])
    if n>=6:
        candles.append((mini, maxi, n))

Could you please try this on your DataFrame and tell if it works!

answered Nov 04 '20 at 23:18

Yassine Majdoub

154
9

Thank you a lot! I tried this but i got the following error: TypeError: 'numpy.float64' object cannot be interpreted as an integer – JayK23 Nov 05 '20 at 08:55
range(mini, maxi-10) -> range(int(mini), int(maxi-10)) – Bas Nov 05 '20 at 11:49
TypeError: 'numpy.float64' object cannot be interpreted as an integer – JayK23 Nov 05 '20 at 17:45
You can chnage `mini = df['close'].min()` into `mini = int(df['close'].min())` and similarly to maxi. – Yassine Majdoub Nov 07 '20 at 17:58

tom10 · Accepted Answer · 2020-11-05T17:51:58.990

You can find regions in numpy by: 1) making a integer T/F array that marks the points in the region; 2) take find where the steps are (into and out of the region) by subtracting neighboring points; 3) using np.nonzero to find the of the boundaries from step 2.

Here's an example (the green band in the final figure marks the region identified only by the two indices returned from nonzero):

import matplotlib.pyplot as plt
import numpy as np

# make some data
dmin, dmax = 0.3, 0.7
x = np.linspace(0, 100, 300)
data = 1 - 1/(1+np.exp(-(x-70)/2))

# do the three step above:
region = ((data>dmin) & (data<dmax)).astype(int)  # mark region with 1s and 0s
boundaries = region[1:] - region[:-1]  # calculate the boundaries to 1s and -1s corresponding to "into" and "out of", or use np.diff
indices = np.nonzero(boundaries)   # find the indices of the boundary points

fig, axs = plt.subplots(3, 1)
axs[0].plot(x, data)
axs[1].plot(x, region)
axs[2].plot(x[1:], boundaries)
axs[2].axvspan(x[indices[0][0]], x[indices[0][1]], facecolor='g', alpha=0.2)

For finding multiple regions larger than a certain length, loop through the list of boundary indices to build a list of boundary pairs, which is mostly a matter of bookkeeping and worrying about the end points (eg, what happens if you start in a region, etc).

Here's an example that does this. The two main changes are that: 1) I split the boundaries into producing starts and stops indices; and, 2) I calculate large_rios.

dmin, dmax = 0.3, 1000  # just look for being above a min: for multiple regions, make some data that oscillates and this is easier to visualize
minL = 10

# make up  some data
x = np.linspace(0, 98.5, 600)  # 98.5 so data ends in a region of interest, which is a case I wanted to check for
data0 = 1-np.exp(-(x-50)**2/400.)
data = 0.5 + 0.5*np.sin((1+1*(data0+1))*x)

rois = ((data>dmin) & (data<dmax)).astype(int) # roi = "region of interest"
boundaries = rois[1:] - rois[:-1]
starts = list(np.nonzero(boundaries>0)[0])  # starting points of roi, and make a list for easy insertion
stops = list(np.nonzero(boundaries<0)[0])   # stopping points of roi, and make a list for easy appending

if stops[0] < starts[0]: # if data starts in a roi, fix it
    starts.insert(0,0)

if starts[-1]>stops[-1]: # if data stops in a roi, fix it
    stops.append(len(data))

large_rois = [(start, stop) for (start, stop) in zip(starts, stops) if stop-start > minL]

print(large_rois)

fig, axs = plt.subplots(3, 1)
axs[0].plot(x, data)
axs[1].plot(x, rois)
axs[2].plot(x[1:], boundaries)
for (start, stop) in large_rois:
    axs[2].axvspan(x[start], x[stop], facecolor='r', alpha=0.4)

Also, note here, that I have a loop through a list, and generally when using pandas and numpy it's best to try to avoid such loops, but in this case, the loop is not through all of the data, but just the list of endpoints, which is a much shorter list than the raw data.

Finally, note here as with all problems where you're trying to find a region of discreet data, there's a question about how to handle the boundaries, so if this matters, be sure to work this out as you need.

This is very interesting. Thank you a lot! I'm just having troubles understanding how to apply this to my own code. In my own case i need to scan the whole dataset and find zones where there are more than N closes or where a candle "ended" — JayK23, Nov 05 '20 at 08:56
@JayK23: I edited the answer to include the case of multiple regions and selecting them based on their length. — tom10, Nov 05 '20 at 17:37
This is a very interesting approach. Numpy and Pandas can really do a lot of things! I'm going to try this code and report if i have any problem. Thank you a lot! — JayK23, Nov 05 '20 at 17:49

Detecting areas in a Python dataset

3 Answers3