Melting incomplete data in python without dropping all NaN

Question

I'm trying to melt 3 columns into one categorical column using pd.melt. Currently, the dataframe looks some like this.

    id1   Plane  Car   Boat
0   123   None   None  None
1   124   Plane  None  None
2   125   None   None  Boat

At some point, I replace None with NaN, but I'm not sure if that is necessary before melting. My goal is to have 1 category column, which lists the type of vehicle it is, using None only if all columns are empty.

    id1   Type
0   123   None   
1   124   Plane  
2   125   Boat

The code I came up with was this:

df = pd.melt(df, id_vars=['id1'], var_name='Type')

The issue I have is that it triples the observations in my dataframe. I could filter out rows where Type = None, but that drops data such as id1 = 123 where all three of the original columns were None.

    id1   Type
0   123   None   
1   123   None  
2   123   None  
3   124   Plane
4   124   None   
5   124   None

Is there an efficient way to do this with melt? Or do I need to loop through the data and write to a new dataframe with conditionals?

score 3 · Accepted Answer · answered Jul 22 '18 at 05:04

3

You and do it this way. Using reindex to get back those missing id values.

df1 = df.replace('None',np.nan).set_index('id1')
df1.stack().reset_index(level=1, drop=True).reindex(df1.index)

Output:

id1
123      NaN
124    Plane
125     Boat
dtype: object

answered Jul 22 '18 at 05:04

Scott Boston

147,308
15
139
187

My actual dataframe doesnt have an index. I use a column of numeric non-sequential unique ID's. I think this creates problems. – TylerLOL Jul 22 '18 at 16:13
Actually, you dataframe doesn't need an index for this method to work. I set the index using 'ID' column. And, the sequence doesn't matter. If you try ths particular method what problems/errors are you getting? – Scott Boston Jul 22 '18 at 18:13
ValueError: cannot reindex from a duplicate axis – TylerLOL Jul 22 '18 at 19:35
Okay, what this means is that you have more that one id in your data set. Or and idea has more than one category assigned. In these cases you want two or more records? – Scott Boston Jul 22 '18 at 19:53
Running value_counts on Id shows no duplicates. It is possible that one observation has more than one category, it's a large dataset, Roughly 2000 observations. If theres more than 1 category, I would want both records. – TylerLOL Jul 22 '18 at 20:51

jezrael · Answer 2 · 2018-07-22T05:47:08.660

You can use back filling missing values and tehn select first column by positions - by iloc:

df = df.replace('None', np.nan)

df = df.set_index('id1').bfill(axis=1).iloc[:, 0].rename('Type').reset_index()
print (df)
   id1   Type
0  123    NaN
1  124  Plane
2  125   Boat

If performance is important is possible use justify function in numpy with 2 changes:

def justify(a, invalid_val=0, axis=1, side='left'):    
    """
    Justifies a 2D array

    Parameters
    ----------
    A : ndarray
        Input array to be justified
    axis : int
        Axis along which justification is to be made
    side : str
        Direction of justification. It could be 'left', 'right', 'up', 'down'
        It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.

    """

    if invalid_val is np.nan:
        mask = pd.notnull(a) <- change to notnull
    else:
        mask = a!=invalid_val
    justified_mask = np.sort(mask,axis=axis)
    if (side=='up') | (side=='left'):
        justified_mask = np.flip(justified_mask,axis=axis)
    out = np.full(a.shape, invalid_val, dtype=object)  <- change dtype to object
    if axis==1:
        out[justified_mask] = a[mask]
    else:
        out.T[justified_mask.T] = a.T[mask.T]
    return out

Same idea in numpy - new Dataframe is created by assign by 1d array:

arr = df.replace('None', np.nan).values[:, 1:]
out = justify(arr, invalid_val=np.nan)[:, 0]
print (out)
[nan 'Plane' 'Boat']

df = df[['id1']].assign(Type=out)
print (df)
   id1   Type
0  123    NaN
1  124  Plane
2  125   Boat

score 1 · Answer 3 · answered Jul 22 '18 at 05:18

No need to use pd.melt, just use this:

df=df.replace('None',pd.np.nan)
df['final']=df.apply(lambda a: pd.Series(a[1:]).dropna()[0] if len(pd.Series(a[1:]).dropna())!=0 else pd.np.nan,axis=1)
print(df[['id1','final']])

Output:

   id1  final
0  123    NaN
1  124  Plane
2  125   Boat

score 1 · Answer 4 · answered Jul 22 '18 at 05:29

You can achieve this result by taking the relevant rows from transpose dataframe, converting None to empty string and summing the values as demonstrated below.

Input:

from io import StringIO
df = pd.read_table(StringIO("""    id1   Plane  Car   Boat
0   123   None   None  None
1   124   Plane  None  None
2   125   None   None  Boat"""), delimiter="\s+")
df
Out[229]: 
   id1  Plane   Car  Boat
0  123   None  None  None
1  124  Plane  None  None
2  125   None  None  Boat

Code:

df["Type"] = df.T.iloc[1:].replace({"None":""}).sum().replace({"":"None"})
df.drop(columns=['Plane', 'Car', 'Boat'], inplace=True)

Output:

df
Out[231]: 
   id1   Type
0  123   None
1  124  Plane
2  125   Boat

This was useful, it revealed that a small number of observations had more than one category assigned by categorizing them such as planeboat — TylerLOL, Jul 23 '18 at 02:56
thanks, that's good to know. I think jezrael has best answer tbh — cosmic_inquiry, Jul 23 '18 at 19:40

Melting incomplete data in python without dropping all NaN

4 Answers4