4

I have a DataFrame like this.

>>> df = pd.DataFrame([[3., 0, 0], [0, 3., 0], [0, 0, 0], [0, 6., 6.], [1., 0, 0], [2., 5., 0]]).T
>>> df
     0    1    2    3    4    5
0  3.0  0.0  0.0  0.0  1.0  2.0
1  0.0  3.0  0.0  6.0  0.0  5.0
2  0.0  0.0  0.0  6.0  0.0  0.0

What I want to do is to keep the first element, column by column, replacing other non-zero values with a zero.

>>> expected
     0    1    2    3    4    5
0  3.0  0.0  0.0  0.0  1.0  2.0
1  0.0  3.0  0.0  6.0  0.0  0.0
2  0.0  0.0  0.0  0.0  0.0  0.0

My goal is to get a Series of the first elements, and I thought doing this via sum(), so I need zero values for other elements in column.

>>> expected.sum()
0    3.0
1    3.0
2    0.0
3    6.0
4    1.0
5    2.0
dtype: float64

Thank you very much in advance.

crissal
  • 2,547
  • 7
  • 25

3 Answers3

4

Mask the zero's then bfill and select the the first row using iloc

df[df != 0].bfill().iloc[0].fillna(0)

0    3.0
1    3.0
2    0.0
3    6.0
4    1.0
5    2.0
Name: 0, dtype: float64
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
2

Another way to first create your target dataframe using a boolean with mask, then sum and specify your axis.

df_new = df.mask(~df.ne(0).cumsum(0).cumsum(0).eq(1)).fillna(0)

     0    1    2    3    4    5
0  3.0  0.0  0.0  0.0  0.0  2.0
1  0.0  3.0  0.0  6.0  0.0  0.0
2  0.0  0.0  0.0  0.0  0.0  0.0

then

df_new.sum(0)

0    3.0
1    3.0
2    0.0
3    6.0
4    0.0
5    2.0
dtype: float64
Umar.H
  • 22,559
  • 7
  • 39
  • 74
0

You could do something like:

import pandas as pd

# initialize table
df = pd.DataFrame([[3., 0, 0], [0, 3., 0], [0, 0, 0], [0, 6., 6.], [1., 0, 0], [2., 5., 0]]).T

# detect first non-zero value
# see https://stackoverflow.com/questions/50586146/find-first-non-zero-value-in-each-column-of-pandas-dataframe for details
non_zero_indexes = list(df.ne(0).idxmax()) # [0, 1, 0, 1, 0, 0]

for col_id in df.columns:
  if non_zero_indexes[col_id] != 0 and len(df) > 1:
    col_start = list(df[col_id][:non_zero_indexes[col_id]+1]) # e.g. [0.0, 6.0]
    col_end   = [0.0] * (len(df) - len(col_start)) # [0.0], i.e. fill with zeros
    df[col_id] = col_start + col_end # merge and get [0.0, 6.0, 0.0]

That way, you get the following output:

>>> df
     0    1    2    3    4    5
0  3.0  0.0  0.0  0.0  1.0  2.0
1  0.0  3.0  0.0  6.0  0.0  0.0
2  0.0  0.0  0.0  0.0  0.0  0.0
SpaceBurger
  • 537
  • 2
  • 12
  • 1
    For a single value access [at instead of *loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html) is preferred, and more fast – Glauco Jul 05 '21 at 08:57
  • This output however is different from my expected. My first non-zero value can be anywhere from first to last row - or be missing at all. – crissal Jul 05 '21 at 09:00
  • Also, I just saw my answer is incorrect. For column 3 for example, I need `(0.0, 6.0, 0.0)` instead of `(0.0, 6.0, 6.0)`. – SpaceBurger Jul 05 '21 at 09:00
  • There is a partial answer to the question [here](https://stackoverflow.com/questions/50586146/find-first-non-zero-value-in-each-column-of-pandas-dataframe), I'll try to use that in my answer – SpaceBurger Jul 05 '21 at 09:02