1

I have a big dataframe with trading data, and I am trying to group the data in small clusters. I have a column called Ticks, with either 1,-1 or 0. I want to do a conditional statement like "if tick is 1 or -1, count 1, if it is 0 don't add anything but still keep that trade in memory. Once we reach 1000 (so 1000 occurences with either 1 or -1, and all the lines with "0" in between), create a new array and continue this operation.

I read about numpy and how it is much faster than traditional python loops, but I'm not sure how to do this without loops. I read quite a few similar stackoverflow issues and explanations about vectorization etc... But I have to admit I did not really understand them, maybe because my math level is too low or I did not understand the code used in the example. But I couldn't find an example with data from pandas converted to numpy. example of my dataframe

James Chen
  • 89
  • 1
  • 8
  • 1
    see how to make a [minimal, reproducible example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples), – sammywemmy May 16 '20 at 02:16

2 Answers2

1

You can do this without an explicit for loop, but I'm not sure if this uses loops under the hood, sry :D

x = pd.Series([0,1,-1,0,1,1,1,-1,0,1,1,-1]) # or x = df['Tick']
x = (x!=0).cumsum()
ix = x.loc[x==5].index[0] # I put 5 to work on this example, but you can set to 1000

And ix will be the index corresponding to the occurrence of the first time you reached 1000 values different from zero.

Adelson Araújo
  • 332
  • 1
  • 5
  • 17
1

Another way to accomplish that is using the cumsum remainder of the target sum value, so every sum up to 1000 will mark a new array to be splited. The downside of this solution is that you still have to iterate over the variable length arrays.

Random choice array used as input:

[ 1 -1  0 -1 -1  0 -1 -1 -1  0  0 -1  0  0 -1  1  0 -1  1  1  0 -1  0  0
  1 -1  0  1  1 -1  1  0 -1  0 -1 -1  1  1  1 -1 -1  1 -1  0 -1 -1  1 -1
  1  1 -1  0  1  1  1  1  0  0  0 -1  0  0  1  1  1 -1  1  1  0  1  1  1
  0 -1  0  0  0  1 -1  1 -1  1  1  1  0  0 -1  1  0 -1  1  0  1  0 -1 -1
  0  0  0  1]
import pandas as pd
import numpy as np

TARGET_SUM = 5  # change to 1000 here

pds = pd.Series(np.random.choice([-1, 0, 1], size=100))
x = (pds!=0).cumsum()
ix = x.loc[(x % TARGET_SUM == 0) & (x > 0)].drop_duplicates().index

ix += 1
ix = ix.insert(0, 0)

df = pd.DataFrame()
for idx in range(len(ix)-1):
    new_arr = pds[ix[idx]:ix[idx + 1]].to_list()
    df = df.append([[new_arr, len(new_arr), np.sum(np.abs(new_arr))]], ignore_index=True)

df.columns=["NEW_ARRAY","LEN(NEW_ARRAY)","SUM(ABS(NEW_ARRAY))"]
print(df)
r = len(pds)- df.iloc[:,1].sum()
print("Unused values from original df: {} - {} = {}".format(len(pds), df.iloc[:,1].sum(), r))

Output from df

                         NEW_ARRAY  LEN(NEW_ARRAY)  SUM(ABS(NEW_ARRAY))
0         [1, -1, 0, -1, -1, 0, -1]               7                    5
1   [-1, -1, 0, 0, -1, 0, 0, -1, 1]               9                    5
2     [0, -1, 1, 1, 0, -1, 0, 0, 1]               9                    5
3              [-1, 0, 1, 1, -1, 1]               6                    5
4          [0, -1, 0, -1, -1, 1, 1]               7                    5
5                [1, -1, -1, 1, -1]               5                    5
6             [0, -1, -1, 1, -1, 1]               6                    5
7               [1, -1, 0, 1, 1, 1]               6                    5
8   [1, 0, 0, 0, -1, 0, 0, 1, 1, 1]              10                    5
9               [-1, 1, 1, 0, 1, 1]               6                    5
10    [1, 0, -1, 0, 0, 0, 1, -1, 1]               9                    5
11          [-1, 1, 1, 1, 0, 0, -1]               7                    5
12       [1, 0, -1, 1, 0, 1, 0, -1]               8                    5

Unused values from original df: 100 - 95 = 5
n1colas.m
  • 3,863
  • 4
  • 15
  • 28