1

So I have a DataFrame with 180000+ values and I need to (1) replace duplicate and certain values in cells by row and (2) rearrange. Here is my DataFrame, df:

    key   sellyr  brand  makrc  item1  item2  item3  item4  item5  item6
0   da12  2013    imp    apt    furi   apt    nan    nan    nan    nan
1   da32  2013    sa     rye    rye    app    nan    nan    nan    nan 
2   da14  2013    sa     pro    not    pro    pan    fan    nan    nan
........

nan values represent np.nan. And forbidden string is 'not'.

So what I need to do is check columns item1~6 replace strings that are contained in the makrc column with nan. As well, I also want to replace 'not's' with nan's. After replacing strings to np.nan, I need to rearrange the item1~6 to left justify non-nan data to the leftmost empty cell, as shown below, (expected output):

    key   sellyr  brand  makrc  item1  item2  item3  item4  item5  item6
0   da12  2013    imp    apt    furi   nan    nan    nan    nan    nan
1   da32  2013    sa     rye    app    nan    nan    nan    nan    nan 
2   da14  2013    sa     pro    pan    fan    nan    nan    nan    nan
........

So as you can see in a first index, I have removed apt string in item2 and changed to np.nan because same string is in makrc column. In index 1, I have removed rye and replace with np.nan. But this time, I rearranged the 'app' string from item2 to item1 because np.nan values should come after the values. In index 2, I have replaced pro and not since I need to replace every 'not'string in the item columns to np.nan. Also I have rearranged the items.

I've tried combining all item columns as a list and replacing it, but there are a few rows with only np.nan items. Can you guys recommend an ideal process to solve my problem? Thank you so much.

D.I.
  • 31
  • 6
EJ Kang
  • 455
  • 2
  • 5
  • 17

1 Answers1

3

First, extract a slice of columns beginning with item -

m = df.columns.str.contains('item')
i = df.iloc[:, m]

Mask all values which meet your criteria. Use isin -

j = i[~i.isin(df.makrc.tolist() + ['not'])]

Now. sort values based on NaNs and assign back -

df.loc[:, m] = j.apply(sorted, key=pd.isnull, axis=1)
df

    key  sellyr brand makrc item1 item2  item3  item4  item5  item6
0  da12    2013   imp   apt  furi   NaN    NaN    NaN    NaN    NaN
1  da32    2013    sa   rye   app   NaN    NaN    NaN    NaN    NaN
2  da14    2013    sa   pro   pan   fan    NaN    NaN    NaN    NaN

Details

i

  item1 item2 item3 item4  item5  item6
0  furi   apt   NaN   NaN    NaN    NaN
1   rye   app   NaN   NaN    NaN    NaN
2   not   pro   pan   fan    NaN    NaN
j

  item1 item2 item3 item4  item5  item6
0  furi   NaN   NaN   NaN    NaN    NaN
1   NaN   app   NaN   NaN    NaN    NaN
2   NaN   NaN   pan   fan    NaN    NaN

Toward Better Performance

You could make use of a modified version of Divakar's justified function that works on object arrays -

def justify(a, invalid_val=0, axis=1, side='left'):    
    """
    Justifies a 2D array

    Parameters
    ----------
    A : ndarray
        Input array to be justified
    axis : int
        Axis along which justification is to be made
    side : str
        Direction of justification. It could be 'left', 'right', 'up', 'down'
        It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.

    """

    if invalid_val is np.nan:
        mask = pd.notnull(a)
    else:
        mask = a!=invalid_val
    justified_mask = np.sort(mask,axis=axis)
    if (side=='up') | (side=='left'):
        justified_mask = np.flip(justified_mask,axis=axis)
    out = np.full(a.shape, invalid_val, dtype=object) 
    if axis==1:
        out[justified_mask] = a[mask]
    else:
        out.T[justified_mask.T] = a.T[mask.T]
    return out
df.loc[:, m] = justify(j.values, invalid_val=np.nan, axis=1, side='left')
df

    key  sellyr brand makrc item1 item2  item3  item4  item5  item6
0  da12    2013   imp   apt  furi   NaN    NaN    NaN    NaN    NaN
1  da32    2013    sa   rye   app   NaN    NaN    NaN    NaN    NaN
2  da14    2013    sa   pro   pan   fan    NaN    NaN    NaN    NaN

This should (hopefully) be faster than calling apply. You'll especially see speed gains using the original version of the function that is optimised for numeric data.

cs95
  • 379,657
  • 97
  • 704
  • 746