I have the following pandas
df
:
| Date | GB | US | CA | AU | SG | DE | FR |
| ---- | -- | -- | -- | -- | -- | -- | -- |
| 1 | 25 | | | | | | |
| 2 | 29 | | | | | | |
| 3 | 33 | | | | | | |
| 4 | 31 | 35 | | | | | |
| 5 | 30 | 34 | | | | | |
| 6 | | 35 | 34 | | | | |
| 7 | | 31 | 26 | | | | |
| 8 | | 33 | 25 | 31 | | | |
| 9 | | | 26 | 31 | | | |
| 10 | | | 27 | 26 | 28 | | |
| 11 | | | 35 | 25 | 29 | | |
| 12 | | | | 33 | 35 | 28 | |
| 13 | | | | 28 | 25 | 35 | |
| 14 | | | | 25 | 25 | 28 | |
| 15 | | | | 25 | 26 | 31 | 25 |
| 16 | | | | | 26 | 31 | 27 |
| 17 | | | | | 34 | 29 | 25 |
| 18 | | | | | 28 | 29 | 31 |
| 19 | | | | | | 34 | 26 |
| 20 | | | | | | 28 | 30 |
I have partly acomplished what I am trying to do here using Pandas alone but the process takes ages so I am having to use numpy
(see Getting the nearest values to the left in a pandas column) and that is where I am struggling.
Essentialy, I want my function f
which takes an argument int(offset)
, to capture the first non nan
value for each row from the left, and return the whole thing as a numpy
array/vector so that:
f(offset=0)
| 0 | 1 |
| -- | -- |
| 1 | 25 |
| 2 | 29 |
| 3 | 33 |
| 4 | 31 |
| 5 | 30 |
| 6 | 35 |
| 7 | 31 |
| 8 | 33 |
| 9 | 26 |
| 10 | 27 |
| 11 | 35 |
| 12 | 33 |
| 13 | 28 |
| 14 | 25 |
| 15 | 25 |
| 16 | 26 |
| 17 | 34 |
| 18 | 28 |
| 19 | 34 |
| 20 | 28 |
As I have described in the other post, its best to imagine a horizontal line being drawn from the left for each row, and returning the values intersected by that line as an array. offset=0
then returns the first value (in that array) and offset=1
will return the second value intersected and so on.
Therefore:
f(offset=1)
| 0 | 1 |
| -- | --- |
| 1 | nan |
| 2 | nan |
| 3 | nan |
| 4 | 35 |
| 5 | 34 |
| 6 | 34 |
| 7 | 26 |
| 8 | 25 |
| 9 | 31 |
| 10 | 26 |
| 11 | 25 |
| 12 | 35 |
| 13 | 25 |
| 14 | 25 |
| 15 | 26 |
| 16 | 31 |
| 17 | 29 |
| 18 | 29 |
| 19 | 26 |
| 20 | 30 |
The pandas
solution proposed in the post above is very effective:
def f(df, offset=0):
x = df.iloc[:, 0:].apply(lambda x: sorted(x, key=pd.isna)[offset], axis=1)
return x
print(f(df, 1))
However this is very slow with larger iterations. I have tried this with np.apply_along_axis
and its even slower!
Is there a fatser way with numpy
vectorization?
Many thanks.