1

I have a pandas dataframe where I want to loop over its rows and calculate a metric starting with from first row to 2nd, if not found there, check from first row to 3rd row, 4th row etc. and compare this metric with another value. I want to get the row number which the condition is first met. To give a concrete example, for a dataframe with length 30, it might be from df.iloc[0:10] df.iloc[10:15] and df.iloc[15:27], df.iloc[27:30] , where values 10, 15, 27 are stored in a list.

An example dataframe:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0,100, size=(100, 1)), columns=list('A'))
df  
    A
0   5
1  11
2   8
3   1
4  16
5  24
some_value = 20 
mylist = []
for i in range(len(df)):
    for j in range(i+2, range(len(df)):
        # Metric calculated on the relevant rows
        metric = df.iloc[i:j]['A'].sum()
        if metric >= some_value:
           mylist.append(j)
           break

The loop starts with df.iloc[0:2], and calculates 5+11, since it is not greater than the some_value (20), it passes to df.iloc[0:3]. This time, since 5+11+8 is greater than some_value, I want to save this number (2) and don't check df.iloc[0:4]. Then the loop should start checking again starting from df.iloc[3:5] this time (1+16), since the condition is not met, continue with df.iloc[3:6] (1+16+24) and so on and save the points when the condition is met.

Example output for this case is a list with values: [2, 5]

I wrote the code above but couldn't fully achieve what I want.Could you help about this issue? Thanks.

rpanai
  • 12,515
  • 2
  • 42
  • 64
Solijoli
  • 464
  • 2
  • 8
  • 19
  • 2
    please provide sample start DataFrame and sample final-output DataFrame. check this post out on how to post a reproducible question: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – Ukrainian-serge Sep 23 '20 at 17:41
  • There may be a much better algorithm than checking every (i, j). You might be able to make a moving window by advancing i and j as needed, or use an accumulator. It's hard to tell exactly what details you want, and I don't think you really need a dataframe. (edit -- ah you just edited, I'll reread) – Kenny Ostrom Sep 23 '20 at 18:05
  • wouldn't that be [2, 5]? The condition is first met at [5 + 11 + 8] which happens at index 2 – Kenny Ostrom Sep 23 '20 at 18:12
  • Yes you are right, corrected it, thank you. – Solijoli Sep 23 '20 at 18:14

3 Answers3

0

Currently, your loop is O(n^2). But once you find a match for your starting value of i, your outer loop has to start over at i+1, and you don't want to start there. You want to start at j. Here's a quick fix to your code.

I don't have numpy at the moment, so I am using a python list as the data.

data = [5, 11, 8, 1, 16, 24]
some_value = 20 
mylist = []
j = 0
for i in range(len(data)):
    # can't change iteration so just skip ahead with continue
    if i < j:
        continue
    # range expects second argument to be past the end
    # dunno if df is the same, but probably?
    for j in range(i+1, len(data)+1):
        metric = sum(data[i:j])
        if metric >= some_value:
            mylist.append(j-1)
            break
print(mylist)

[2, 5]

I'd suggest doing this in one loop, and keep a running total (accumulator). Here I get a little fancy with returning the range, in case you want to splice the df:

data = [5, 11, 8, 1, 16, 24]
threshold = 20

def accumulate_to_threshold(data, threshold):
    start = 0
    total = 0
    for index, item in enumerate(data):
        total += item
        if total > threshold:
            yield (start, index+1)
            total = 0
            start = index+1
    # leftovers below threshold here

for start, end in accumulate_to_threshold(data, threshold):
    sublist = data[start:end]
    print (sublist, "totals to", sum(sublist))

[5, 11, 8] totals to 24
[1, 16, 24] totals to 41

And of course, instead of yielding a range, you can yield index and get the [2, 5] from above.

Kenny Ostrom
  • 5,639
  • 2
  • 21
  • 30
0

My approach would be to:

I don't know if this will answer your question the way you want, but I will show how my brain approached it using the built-in vectorization of pandas/numpy, in short looping is a hassle(slow) and should be avoided if possible:

import pandas as pd
import numpy as np

# made it smaller
df = pd.DataFrame(np.random.randint(0,25, size=(20, 1)), columns=list('A'))

numpy.reshape() and sum()

We'll reshape col A, which moves the values to be side by side, then sum going across axis=1:

Compare df to re_shaped below. Notice how the values have been rearranged


re_shaped = np.reshape(df.A.values, (10, 2))
print(df)

     A
0    5
1   11
2    8
3   23
...
16   6
17  14
18   3
19   0

print(re_shaped)

array([[ 5, 11],
       [ 8, 23],
       ...
       [ 6, 14],
       [ 3,  0]])

summed = re_shaped.sum(axis=1)
print(summed)

array([16, 31, 15, 19, 13, 21, 28, 30, 20,  3])

boolean mask

some_value = 20
greater_than_some_value = summed[summed >= some_value]
print(greater_than_some_value)

array([31, 21, 28, 30, 20])

And there you have it. Hope it helped.

Ukrainian-serge
  • 854
  • 7
  • 12
0

Have you considered to use one loop only:

import pandas as pd
import numpy as np

n = int(1e6)
df = pd.DataFrame({"A": np.random.randint(100, size=n)})

threshold = 20
my_list = []
s = 0
for i, k in enumerate(df["A"].values):
    if s + k > threshold:
        my_list.append(i)
        s = 0
    else:
        s += k

You can eventually use numba but I think the best think is to calculate a cumsum with reset in your df.

Numba

The previous could be write as a function

def fun(vec, threshold=20):
    my_list = []
    s = 0
    for i, k in enumerate(vec):
        if s + k > threshold:
            my_list.append(i)
            s = 0
        else:
            s += k
    return my_list

And we can use numba

from numba import jit

@jit(nopython=True, cache=True, nogil=True)
def fun_numba(vec, threshold=20):
    my_list = []
    s = 0
    for i, k in enumerate(vec):
        if s + k > threshold:
            my_list.append(i)
            s = 0
        else:
            s += k
    return my_list
%%timeit -n 5 -r 5
my_list = fun(df["A"].values)
606 ms ± 28 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
%%timeit -n 5 -r 5
my_list = fun_numba(df["A"].values)
59.6 ms ± 20.4 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

which is ~10x speedup.

rpanai
  • 12,515
  • 2
  • 42
  • 64