4

I have gone through several posts and they either only apply to examples with one column, or with only NaN or 0 values - but not both.

My df looks like this. I would like to fill-in column 'Main' with the non-missing or non-zero string found in the four columns right to it.

current df =

import pandas as pd

d = {'Main': ['','','',''], 'col2': ['Big','','',0], 'col3': [0,'Medium',0,''], 'col4': ['','','Small',''], 'col5':['',0,'','Vsmall']}
df = pd.DataFrame(data=d)

+------+------+--------+-------+--------+
| Main | Col2 | Col3   | Col4  | Col5   |
+------+------+--------+-------+--------+
|      | Big  | 0      | ...   |        |
+------+------+--------+-------+--------+
|      | ...  | Medium | ...   | 0      |
+------+------+--------+-------+--------+
|      |      | 0      | Small |        |
+------+------+--------+-------+--------+
|      | 0    | ...    | ...   | Vsmall |
+------+------+--------+-------+--------+

desired output df

+--------+------+--------+-------+--------+
| Main   | Col2 | Col3   | Col4  | Col5   |
+--------+------+--------+-------+--------+
| Big    | Big  | 0      | ...   |        |
+--------+------+--------+-------+--------+
| Medium | ...  | Medium | ...   | 0      |
+--------+------+--------+-------+--------+
| Small  |      | 0      | Small |        |
+--------+------+--------+-------+--------+
| Vsmall | 0    | ...    | ...   | Vsmall |
+--------+------+--------+-------+--------+

Thanks in advance!

Utsav Chokshi
  • 1,357
  • 1
  • 13
  • 35
StatsScared
  • 517
  • 6
  • 20
  • Next time, Add sample code for data loading and what you have tried so far to solve problem. – Utsav Chokshi Mar 24 '20 at 05:09
  • What if there two(or more) columns to the right having non-zero and not null values? Do you select the first such occurring value or any value? The answer might change accordingly. – nishant Mar 24 '20 at 05:22
  • @nishant.. Agree. In such a case, @jezrael answer with `DataFrame.where` works. – Utsav Chokshi Mar 24 '20 at 05:43

2 Answers2

4

Idea is replace 0 and empty strings to missing values by DataFrame.mask, then back filling missing rows and last select first column:

c = ['col2','col3','col4','col5']
df['Main'] = df[c].mask(df.isin(['0','',0])).bfill(axis=1).iloc[:, 0]
print (df)
     Main col1    col2   col3
0     Big  Big    None       
1  Medium    0  Medium   None
2   Small            0  Small

If possible create list of all possible extracted strings replace all another values by DataFrame.where:

['col2','col3','col4','col5']
df['Main'] = df[c].where(df.isin(['Big','Medium','Small','Vsmall'])).bfill(axis=1).iloc[:,0]
print (df)
     Main col1    col2   col3
0     Big  Big    None       
1  Medium    0  Medium   None
2   Small            0  Small

Details:

print (df[c].mask(df.isin(['0','',0])))
#print (df[c].where(df.isin(['Big','Medium','Small','Vsmall'])))

   col1    col2   col3
0  Big    None    NaN
1  NaN  Medium   None
2  NaN     NaN  Small

print (df[c].mask(df.isin(['0','',0])).bfill(axis=1))
     col1    col2   col3
0     Big     NaN    NaN
1  Medium  Medium   None
2   Small   Small  Small
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

From sample data presented by you, I think what you are trying to achieve is decoding one-hot encoded data (a classic technique for converting categorical data to numerical data in Machine Learning).

Here is code to achieve decoding:

import pandas as pd

d = {'Main': [0,0,0,0], 'col2': ['Big','','',0], 'col3': [0,'Medium',0,''], 'col4': ['','','Small',''], 'col5':['',0,'','Vsmall']}
df = pd.DataFrame(data=d)

def reduce_function(row):
    for col in ['col2','col3','col4','col5']:
        if not pd.isnull(row[col]) and row[col] != 0 and row[col] != '':
            return row[col]

df['Main']=df.apply(reduce_function, axis=1)

Note : Always consider, using reductions (i.e. apply()) on dataframes than iterating over rows.

Utsav Chokshi
  • 1,357
  • 1
  • 13
  • 35
  • Refer to answers over here : https://stackoverflow.com/questions/38334296/reversing-one-hot-encoding-in-pandas – Utsav Chokshi Mar 24 '20 at 05:08
  • 1
    Ya, agree `Always consider, using reductions (i.e. apply()) on dataframes than iterating over rows.` - but `apply` are loops under the hood, so also should be avoid. – jezrael Mar 24 '20 at 05:12