1

I have the following pandas df:

| Date | GB | US | CA | AU | SG | DE | FR |
| ---- | -- | -- | -- | -- | -- | -- | -- |
| 1    | 25 |    |    |    |    |    |    |
| 2    | 29 |    |    |    |    |    |    |
| 3    | 33 |    |    |    |    |    |    |
| 4    | 31 | 35 |    |    |    |    |    |
| 5    | 30 | 34 |    |    |    |    |    |
| 6    |    | 35 | 34 |    |    |    |    |
| 7    |    | 31 | 26 |    |    |    |    |
| 8    |    | 33 | 25 | 31 |    |    |    |
| 9    |    |    | 26 | 31 |    |    |    |
| 10   |    |    | 27 | 26 | 28 |    |    |
| 11   |    |    | 35 | 25 | 29 |    |    |
| 12   |    |    |    | 33 | 35 | 28 |    |
| 13   |    |    |    | 28 | 25 | 35 |    |
| 14   |    |    |    | 25 | 25 | 28 |    |
| 15   |    |    |    | 25 | 26 | 31 | 25 |
| 16   |    |    |    |    | 26 | 31 | 27 |
| 17   |    |    |    |    | 34 | 29 | 25 |
| 18   |    |    |    |    | 28 | 29 | 31 |
| 19   |    |    |    |    |    | 34 | 26 |
| 20   |    |    |    |    |    | 28 | 30 |

I have partly acomplished what I am trying to do here using Pandas alone but the process takes ages so I am having to use numpy (see Getting the nearest values to the left in a pandas column) and that is where I am struggling.

Essentialy, I want my function f which takes an argument int(offset), to capture the first non nan value for each row from the left, and return the whole thing as a numpy array/vector so that:

f(offset=0)


| 0  | 1  |
| -- | -- |
| 1  | 25 |
| 2  | 29 |
| 3  | 33 |
| 4  | 31 |
| 5  | 30 |
| 6  | 35 |
| 7  | 31 |
| 8  | 33 |
| 9  | 26 |
| 10 | 27 |
| 11 | 35 |
| 12 | 33 |
| 13 | 28 |
| 14 | 25 |
| 15 | 25 |
| 16 | 26 |
| 17 | 34 |
| 18 | 28 |
| 19 | 34 |
| 20 | 28 |

As I have described in the other post, its best to imagine a horizontal line being drawn from the left for each row, and returning the values intersected by that line as an array. offset=0 then returns the first value (in that array) and offset=1 will return the second value intersected and so on.

Therefore:

f(offset=1)

| 0  | 1   |
| -- | --- |
| 1  | nan |
| 2  | nan |
| 3  | nan |
| 4  | 35  |
| 5  | 34  |
| 6  | 34  |
| 7  | 26  |
| 8  | 25  |
| 9  | 31  |
| 10 | 26  |
| 11 | 25  |
| 12 | 35  |
| 13 | 25  |
| 14 | 25  |
| 15 | 26  |
| 16 | 31  |
| 17 | 29  |
| 18 | 29  |
| 19 | 26  |
| 20 | 30  |

The pandas solution proposed in the post above is very effective:

def f(df, offset=0):
    x = df.iloc[:, 0:].apply(lambda x: sorted(x, key=pd.isna)[offset], axis=1)
    return x

print(f(df, 1))

However this is very slow with larger iterations. I have tried this with np.apply_along_axis and its even slower!

Is there a fatser way with numpy vectorization?

Many thanks.

RebeccaKennedy
  • 121
  • 1
  • 10

2 Answers2

2

Numpy approach

We can define a function first_value which takes a 2D array and offset (n) as input arguments and returns 1D array. Basically, for each row it returns the nth value after the first non-nan value

def first_valid(arr, offset=0):
    m = ~np.isnan(arr)
    i =  m.argmax(axis=1) + offset
    iy = np.clip(i, 0, arr.shape[1] - 1)

    vals = arr[np.r_[:arr.shape[0]], iy]
    vals[(~m.any(1)) | (i >= arr.shape[1])] = np.nan
    return vals

Pandas approach

We can stack the dataframe to reshape then group the dataframe on level=0 and aggregate using nth, then reindex to conform the index of aggregated frame according to original frame

def first_valid(df, offset=0):
    return df.stack().groupby(level=0)\
                     .nth(offset).reindex(df.index)

Sample run

>>> first_valid(df, 0)
Date
1     25.0
2     29.0
3     33.0
4     31.0
5     30.0
6     35.0
7     31.0
8     33.0
9     26.0
10    27.0
11    35.0
12    33.0
13    28.0
14    25.0
15    25.0
16    26.0
17    34.0
18    28.0
19    34.0
20    28.0
dtype: float64


>>> first_valid(df, 1)
Date
1      NaN
2      NaN
3      NaN
4     35.0
5     34.0
6     34.0
7     26.0
8     25.0
9     31.0
10    26.0
11    25.0
12    35.0
13    25.0
14    25.0
15    26.0
16    31.0
17    29.0
18    29.0
19    26.0
20    30.0
dtype: float64

>>> first_valid(df, 2)
Date
1      NaN
2      NaN
3      NaN
4      NaN
5      NaN
6      NaN
7      NaN
8     31.0
9      NaN
10    28.0
11    29.0
12    28.0
13    35.0
14    28.0
15    31.0
16    27.0
17    25.0
18    31.0
19     NaN
20     NaN
dtype: float64

Performance

# Sample dataframe for testing purpose
df_test = pd.concat([df] * 10000, ignore_index=True)

%%timeit # Numpy approach
_ = first_valid(df_test.to_numpy(), 1)
# 6.9 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%%timeit # Pandas approach
_ = first_valid(df_test, 1)
# 90 ms ± 867 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


%%timeit # OP's approach
_ = f(df_test, 1)
# 2.03 s ± 183 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Numpy based approach is approximately 300x faster than the OP's given approach while pandas based approach is approximately 22x faster

Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • 1
    This is very fast indeed!! It would be very good if I could compile it with Numba as well which only accepts numpy arrays. Is there a way to remove pandas altoegther from this? – RebeccaKennedy Jun 27 '21 at 12:22
  • 1
    @jazz090 I'll have to think over it. I'll get back to you if there is a viable solution using numba. – Shubham Sharma Jun 27 '21 at 12:26
  • 1
    Many thanks. This has improved things substantially! – RebeccaKennedy Jun 27 '21 at 12:27
  • Ultimately what I am trying to do is calculate the difference between different offset values. `first_valid(df, 1) - first_valid(df, 0)` and `first_valid(df, 1) - first_valid(df, 2)` and so on. I have a loop at present which is itertating the different solutions as `first_valid(df, i+1).values - first_valid(df, i).values` and that is taking around 1 minute for `i<30`. – RebeccaKennedy Jun 27 '21 at 12:38
  • @jazz090 I've added alternative pure `numpy` based approach. – Shubham Sharma Jun 27 '21 at 13:42
  • 1
    Many thanks for you solution. It is very fast!! If you have a look at my output for `f(offset=1)`, particulalry the last values, there should be slight differences in the desired output. For offsets >0, the function returns the first element so for rows 19 and 20, the data selected should come from `FR` and not `DE` if that make sense? Imagine drawing a horizontal line from left to right for each row untill it intersects with a value. `offset` is then the value for *which* intersection it is. so `offset=0` is the first intersection, `offset=1` is the second intersection and so on for `nth`. – RebeccaKennedy Jun 27 '21 at 14:05
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/234260/discussion-between-shubham-sharma-and-jazz090). – Shubham Sharma Jun 27 '21 at 14:11
0

This will do it for offset=0, and should be efficient:

df.unstack().groupby('Date').first()

I thought .nth(1) would work for offset=1 and so on, but it doesn't. This is left as an exercise for the reader.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436