Dataframe: shifting values over columns

Question

I have a dataframe with some NaN values in my s_x columns. If NaN values exist in them, I want them to be in the last columns.

Example: Given values in the s_x columns of [Nan, 1, Nan, 2] I want the values to shift left over the columns to result in [1, 2, NaN, NaN]

Example 2:

My current solution is very slow as I:

iterate over the rows
transform the s_x values into a list
remove NaN values
left-pad the list with NaN values
write it back into the dataframe

How can I improve on the function below? The order of values (low to high) needs to remain the same. Every value is found only once in the s_x columns of a row.

I know that "leaving the pandas-logic" by parsing to a list and back is problematic concerning performance and was thinking of trying to do it with a lambda function, but didn't get anywhere with it.

My current code as a minimal working example:

import pandas as pd
import numpy as np

def shift_values(df, leading_chars):
    """Shifts all values in columns with common leading chars to the left if there are NaN values.
    
    Example:   Given a row of [NaN, 1, NaN, 2]
    the values are shifted to [1, 2, NaN, NaN]
    
    """
    cols = [c for c in list(df.columns) if c[:len(leading_chars)] == leading_chars] 

    for index, row in df.iterrows():
        # create list without NaN values
        values = [v for v in row[cols] if not pd.isna(v)] 
        # pad with NaN to get correct number of values again
        values += [np.nan] * (len(cols) - len(values))  

        # overwrite row values with modified list
        for i, c in enumerate(cols): 
            row[c] = values[i]

        # overwrite row in the dataframe
        df.iloc[index] = row

    return df 

mylist = [["key", "s_1", "s_2", "s_3", "s_4"],
          [1, np.nan, 1, 2, np.nan],
          [1, 10, 20, 25, np.nan],
          [1, 10, np.nan, 25, np.nan]
         ]
df = pd.DataFrame(mylist[1:], columns=mylist[0])

print("______ PREVIOUS ______")
print(df.head())

df = shift_values(df, 's_')
print("______ RESULT ______")
print(df.head())

Andrej Kesely · Accepted Answer · 2021-06-22T09:55:44.660

3

Try:

df = df.transform(sorted, key=pd.isna, axis=1)
print(df)

Prints:

   key   s_1   s_2   s_3  s_4
0  1.0   1.0   2.0   NaN  NaN
1  1.0  10.0  20.0  25.0  NaN
2  1.0  10.0  25.0   NaN  NaN

EDIT: If columns are not next to each other:

x = df.filter(regex=r"^s_")

df.loc[:, x.columns] = df.loc[:, x.columns].transform(
    sorted, key=pd.isna, axis=1
)
print(df)

edited Jun 22 '21 at 09:55

answered Jun 22 '21 at 08:34

Andrej Kesely

168,389
15
48
91

Correct me if I'm wrong, but this will go over all columns, not only over my `s_x` columns? – Cribber Jun 22 '21 at 08:37
1

@Cribber You can use something like `df.filter(regex=r'^s_')` if you want to sort over `s_*` columns - But it's not necessary - pythons `sorted` is stable-sort. – Andrej Kesely Jun 22 '21 at 08:39
Could you expand on your solution how I would use the df.filter? – Cribber Jun 22 '21 at 08:47
1

@Cribber Try `df.loc[:, "s_1":"s_4"] = df.filter(regex=r"^s_").transform(sorted, key=pd.isna, axis=1)` – Andrej Kesely Jun 22 '21 at 08:50
only downside I see in comparison to the other solution is that the columns need to be next to each other. – Cribber Jun 22 '21 at 09:38
1

@Cribber - yop, and it is slowier, because apply is loop under the hood. – jezrael Jun 22 '21 at 09:41
@Cribber I've edited my answer with solution when columns `s_*` aren't next to each other. – Andrej Kesely Jun 22 '21 at 09:56

jezrael · Answer 2 · 2021-06-22T08:39:45.690

For improve performance use justify only with selected columns:

#https://stackoverflow.com/a/44559180/2901002
def justify(a, invalid_val=0, axis=1, side='left'):    
    """
    Justifies a 2D array

    Parameters
    ----------
    A : ndarray
        Input array to be justified
    axis : int
        Axis along which justification is to be made
    side : str
        Direction of justification. It could be 'left', 'right', 'up', 'down'
        It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.

    """

    if invalid_val is np.nan:
        mask = ~np.isnan(a)
    else:
        mask = a!=invalid_val
    justified_mask = np.sort(mask,axis=axis)
    if (side=='up') | (side=='left'):
        justified_mask = np.flip(justified_mask,axis=axis)
    out = np.full(a.shape, invalid_val) 
    if axis==1:
        out[justified_mask] = a[mask]
    else:
        out.T[justified_mask.T] = a.T[mask.T]
    return out

def shift_values(df, leading_chars):
    """Shifts all values in columns with common leading chars to the
       left if there are NaN values.
    
    Example:   Given a row of [NaN, 1, NaN, 2]
    the values are shifted to [1, 2, NaN, NaN]
    
    """
    cols = df.columns[df.columns.str.startswith(leading_chars)]
    df[cols] = justify(df[cols].to_numpy(),  invalid_val=np.nan, axis=1, side='left')
    return df
    

df = shift_values(df, 's_')
print("______ RESULT ______")
print(df.head())
______ RESULT ______
   key   s_1   s_2   s_3  s_4
0    1   1.0   2.0   NaN  NaN
1    1  10.0  20.0  25.0  NaN
2    1  10.0  25.0   NaN  NaN

Dataframe: shifting values over columns

2 Answers2