Python: Iterate over a data frame column, check for a condition-value stored in array, and get the values to a list

Question

After some help in the forum I managed to do what I was looking for and now I need to get to the next level. ( the long explanation is here: Python Data Frame: cumulative sum of column until condition is reached and return the index):

I have a data frame:

In [3]: df
Out[3]: 
   index  Num_Albums  Num_authors
0      0          10            4
1      1           1            5
2      2           4            4
3      3           7         1000
4      4           1           44
5      5           3            8

I add a column with the cumulative sum of another column.

In [4]: df['cumsum'] = df['Num_Albums'].cumsum()

In [5]: df
Out[5]: 
   index  Num_Albums  Num_authors  cumsum
0      0          10            4      10
1      1           1            5      11
2      2           4            4      15
3      3           7         1000      22
4      4           1           44      23
5      5           3            8      26

Then I apply a condition to the cumsumcolumn and I extract the corresponding values of the row where the condition is met with a given tolerance:

In [18]: tol = 2

In [19]: cond = df.where((df['cumsum']>=15-tol)&(df['cumsum']<=15+tol)).dropna()

In [20]: cond
Out[20]: 
   index  Num_Albums  Num_authors  cumsum
2    2.0         4.0          4.0    15.0

Now, what I want to do is to substitute to the condition 15 in the example, the conditions stored in an array. Check when the condition is met and retrieve not the entire row, but only the value of the column Num_Albums. Finally, all these retrieved values (one per condition) are stored in an array or list. Coming from matlab, I would do something like this (I apologize for this mixed matlab/python syntax):

conditions = np.array([10, 15, 23])
for i=0:len(conditions)
   retrieved_values(i) = df.where((df['cumsum']>=conditions(i)-tol)&(df['cumsum']<=conditions(i)+tol)).dropna()

So for the data frame above I would get (for tol=0):

retrieved_values = [10, 4, 1]

I would like a solution that lets me keep the .where function if possible..

Po Stevanus Andrianta · Answer 1 · 2017-01-09T14:24:57.020

2

well the output not always be 1 number right? in case the ouput is exact 1 number you can write this code

tol = 0
#condition
c = [5,15,25]
value = []

for i in c:
    if len(df.where((df['a'] >= i - tol) & (df['a'] <= i + tol)).dropna()['a']) > 0:
        value = value + [df.where((df['a'] >= i - tol) & (df['a'] <= i + tol)).dropna()['a'].values[0]]
    else:
        value = value + [[]]
print(value)

the output should be like

[1,2,3]

in case the output can be multiple number and want to be like this

[[1.0, 5.0], [12.0, 15.0], [25.0]]

you can use this code

tol = 5
c = [5,15,25]
value = []

for i in c:
    getdatas = df.where((df['a'] >= i - tol) & (df['a'] <= i + tol)).dropna()['a'].values
    value.append([x for x in getdatas])
print(value)

edited Jan 09 '17 at 14:24

answered Jan 09 '17 at 10:53

Po Stevanus Andrianta

712
5
11

I keep getting this: IndexError: index 0 is out of bounds for axis 0 with size 0 – AMaz Jan 09 '17 at 12:54
@Amaz the 1st option or the second option? the 1st will be indexError since it takes .values[0], need to get validated beforehand, let me edit it for you – Po Stevanus Andrianta Jan 09 '17 at 14:21

score 2 · Answer 2 · edited May 23 '17 at 12:33

A quick way to do would be to leverage NumPy's broadcasting techniques as an extension of this answer from the same post linked, although an answer related to the use of DF.where was actually asked.

Broadcasting eliminates the need to iterate through every element of the array and it's highly efficient at the same time.

The only addition to this post is the use of np.argmax to grab the indices of the first True instance along each column (traversing ↓ direction).

conditions = np.array([10, 15, 23])
tol = 0
num_albums = df.Num_Albums.values
num_albums_cumsum = df.Num_Albums.cumsum().values
slices = np.argmax(np.isclose(num_albums_cumsum[:, None], conditions, atol=tol), axis=0)

Retrieved slices:

slices
Out[692]:
array([0, 2, 4], dtype=int64)

Corresponding array produced:

num_albums[slices]
Out[693]:
array([10,  4,  1], dtype=int64)

If you still prefer using DF.where, here is another solution using list-comprehension -

[df.where((df['cumsum'] >= cond - tol) & (df['cumsum'] <= cond + tol), -1)['Num_Albums']
   .max() for cond in conditions]
Out[695]:
[10, 4, 1]

The conditions not fulfilling the given criteria would be replaced by -1. Doing this way preserves the dtype at the end.

I actually prefer the first option. I am not sure the use of "None" is clear to me. What I get by applying your suggestion is that when the condition is not met, "slices" assumes the value 0. When I call "num_albums[slices]" I get the first value (at index 0) for every position where the condition is not met.. How can I have "slice" be NaN when the condition is not met? — AMaz, Jan 09 '17 at 14:05
`None` here implies `np.newaxis`, which in simple terms means to reshape the array so as to insert an additional dimension to it which allows us to query array in that many dimensions(Here, 2-D array). For the same purpose, `num_albums_cumsum.reshape(-1, 1)` also works. No, `num_albums[slices]` gives you the value for which the condition gets satisfied. If you want `NaN's` to appear for the `False` condition, then I would suggest you to consider `np.where` instead. But I don't understand it's implications here as you just want to grab them in a list/array. — Nickil Maveli, Jan 09 '17 at 14:19

Python: Iterate over a data frame column, check for a condition-value stored in array, and get the values to a list

2 Answers2

Linked