Getting first/second/third... value in row of numpy array after nan using vectorization

Question

I have the following pandas df:

| Date | GB | US | CA | AU | SG | DE | FR |
| ---- | -- | -- | -- | -- | -- | -- | -- |
| 1    | 25 |    |    |    |    |    |    |
| 2    | 29 |    |    |    |    |    |    |
| 3    | 33 |    |    |    |    |    |    |
| 4    | 31 | 35 |    |    |    |    |    |
| 5    | 30 | 34 |    |    |    |    |    |
| 6    |    | 35 | 34 |    |    |    |    |
| 7    |    | 31 | 26 |    |    |    |    |
| 8    |    | 33 | 25 | 31 |    |    |    |
| 9    |    |    | 26 | 31 |    |    |    |
| 10   |    |    | 27 | 26 | 28 |    |    |
| 11   |    |    | 35 | 25 | 29 |    |    |
| 12   |    |    |    | 33 | 35 | 28 |    |
| 13   |    |    |    | 28 | 25 | 35 |    |
| 14   |    |    |    | 25 | 25 | 28 |    |
| 15   |    |    |    | 25 | 26 | 31 | 25 |
| 16   |    |    |    |    | 26 | 31 | 27 |
| 17   |    |    |    |    | 34 | 29 | 25 |
| 18   |    |    |    |    | 28 | 29 | 31 |
| 19   |    |    |    |    |    | 34 | 26 |
| 20   |    |    |    |    |    | 28 | 30 |

I have partly acomplished what I am trying to do here using Pandas alone but the process takes ages so I am having to use numpy (see Getting the nearest values to the left in a pandas column) and that is where I am struggling.

Essentialy, I want my function f which takes an argument int(offset), to capture the first non nan value for each row from the left, and return the whole thing as a numpy array/vector so that:

f(offset=0)


| 0  | 1  |
| -- | -- |
| 1  | 25 |
| 2  | 29 |
| 3  | 33 |
| 4  | 31 |
| 5  | 30 |
| 6  | 35 |
| 7  | 31 |
| 8  | 33 |
| 9  | 26 |
| 10 | 27 |
| 11 | 35 |
| 12 | 33 |
| 13 | 28 |
| 14 | 25 |
| 15 | 25 |
| 16 | 26 |
| 17 | 34 |
| 18 | 28 |
| 19 | 34 |
| 20 | 28 |

As I have described in the other post, its best to imagine a horizontal line being drawn from the left for each row, and returning the values intersected by that line as an array. offset=0 then returns the first value (in that array) and offset=1 will return the second value intersected and so on.

Therefore:

f(offset=1)

| 0  | 1   |
| -- | --- |
| 1  | nan |
| 2  | nan |
| 3  | nan |
| 4  | 35  |
| 5  | 34  |
| 6  | 34  |
| 7  | 26  |
| 8  | 25  |
| 9  | 31  |
| 10 | 26  |
| 11 | 25  |
| 12 | 35  |
| 13 | 25  |
| 14 | 25  |
| 15 | 26  |
| 16 | 31  |
| 17 | 29  |
| 18 | 29  |
| 19 | 26  |
| 20 | 30  |

The pandas solution proposed in the post above is very effective:

def f(df, offset=0):
    x = df.iloc[:, 0:].apply(lambda x: sorted(x, key=pd.isna)[offset], axis=1)
    return x

print(f(df, 1))

However this is very slow with larger iterations. I have tried this with np.apply_along_axis and its even slower!

Is there a fatser way with numpy vectorization?

Many thanks.

Shubham Sharma · Accepted Answer · 2021-06-27T15:44:23.953

Numpy approach

We can define a function first_value which takes a 2D array and offset (n) as input arguments and returns 1D array. Basically, for each row it returns the nth value after the first non-nan value

def first_valid(arr, offset=0):
    m = ~np.isnan(arr)
    i =  m.argmax(axis=1) + offset
    iy = np.clip(i, 0, arr.shape[1] - 1)

    vals = arr[np.r_[:arr.shape[0]], iy]
    vals[(~m.any(1)) | (i >= arr.shape[1])] = np.nan
    return vals

Pandas approach

We can stack the dataframe to reshape then group the dataframe on level=0 and aggregate using nth, then reindex to conform the index of aggregated frame according to original frame

def first_valid(df, offset=0):
    return df.stack().groupby(level=0)\
                     .nth(offset).reindex(df.index)

Sample run

>>> first_valid(df, 0)
Date
1     25.0
2     29.0
3     33.0
4     31.0
5     30.0
6     35.0
7     31.0
8     33.0
9     26.0
10    27.0
11    35.0
12    33.0
13    28.0
14    25.0
15    25.0
16    26.0
17    34.0
18    28.0
19    34.0
20    28.0
dtype: float64


>>> first_valid(df, 1)
Date
1      NaN
2      NaN
3      NaN
4     35.0
5     34.0
6     34.0
7     26.0
8     25.0
9     31.0
10    26.0
11    25.0
12    35.0
13    25.0
14    25.0
15    26.0
16    31.0
17    29.0
18    29.0
19    26.0
20    30.0
dtype: float64

>>> first_valid(df, 2)
Date
1      NaN
2      NaN
3      NaN
4      NaN
5      NaN
6      NaN
7      NaN
8     31.0
9      NaN
10    28.0
11    29.0
12    28.0
13    35.0
14    28.0
15    31.0
16    27.0
17    25.0
18    31.0
19     NaN
20     NaN
dtype: float64

Performance

# Sample dataframe for testing purpose
df_test = pd.concat([df] * 10000, ignore_index=True)

%%timeit # Numpy approach
_ = first_valid(df_test.to_numpy(), 1)
# 6.9 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%%timeit # Pandas approach
_ = first_valid(df_test, 1)
# 90 ms ± 867 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


%%timeit # OP's approach
_ = f(df_test, 1)
# 2.03 s ± 183 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Numpy based approach is approximately 300x faster than the OP's given approach while pandas based approach is approximately 22x faster

This is very fast indeed!! It would be very good if I could compile it with Numba as well which only accepts numpy arrays. Is there a way to remove pandas altoegther from this? — RebeccaKennedy, Jun 27 '21 at 12:22
@jazz090 I'll have to think over it. I'll get back to you if there is a viable solution using numba. — Shubham Sharma, Jun 27 '21 at 12:26
Ultimately what I am trying to do is calculate the difference between different offset values. `first_valid(df, 1) - first_valid(df, 0)` and `first_valid(df, 1) - first_valid(df, 2)` and so on. I have a loop at present which is itertating the different solutions as `first_valid(df, i+1).values - first_valid(df, i).values` and that is taking around 1 minute for `i<30`. — RebeccaKennedy, Jun 27 '21 at 12:38
@jazz090 I've added alternative pure `numpy` based approach. — Shubham Sharma, Jun 27 '21 at 13:42
Many thanks for you solution. It is very fast!! If you have a look at my output for `f(offset=1)`, particulalry the last values, there should be slight differences in the desired output. For offsets >0, the function returns the first element so for rows 19 and 20, the data selected should come from `FR` and not `DE` if that make sense? Imagine drawing a horizontal line from left to right for each row untill it intersects with a value. `offset` is then the value for *which* intersection it is. so `offset=0` is the first intersection, `offset=1` is the second intersection and so on for `nth`. — RebeccaKennedy, Jun 27 '21 at 14:05
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/234260/discussion-between-shubham-sharma-and-jazz090). — Shubham Sharma, Jun 27 '21 at 14:11

score 0 · Answer 2 · answered Jun 27 '21 at 10:33

0

This will do it for offset=0, and should be efficient:

df.unstack().groupby('Date').first()

I thought .nth(1) would work for offset=1 and so on, but it doesn't. This is left as an exercise for the reader.

answered Jun 27 '21 at 10:33

John Zwinck

239,568
38
324
436

Getting first/second/third... value in row of numpy array after nan using vectorization

2 Answers2

Numpy approach

Pandas approach

Sample run

Performance