Remove NaN Cells without dropping entire rows (Pandas, Python)

Question

I am parsing some data from a number pdf documents and storing them in a dataframe for insights. When writing to a pandas dataframe each page from the pdf document is not aligning all the data points under the same column needed.

One way I can fix this is to remove cells that contain NaNs and shift the non-null values left.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Word':['Text1', np.nan, np.nan, 'Text1', 'Text1'],
    'Word2':['Text2', 'Text1', np.nan, 'Text2', np.nan],
    'Word3':['Text3', 'Text2', 'Text1', 'Text3', np.nan]
})
df

Output of sample df:

    Word    Word2   Word3
0   Text1   Text2   Text3
1   NaN     Text1   Text2
2   NaN     NaN     Text1
3   Text1   Text2   Text3
4   Text1   NaN     NaN

Desired output needed:

    Word    Word2   Word3
0   Text1   Text2   Text3
1   Text1   Text2   
2   Text1   
3   Text1   Text2   Text3
4   Text1

In this example, only rows with index 1 and 2 actually change.

Any assistance would be much appreciated.

Alan

@Alex One way I can fix this is to remove cells that contain NaNs and shift the non-null values left. — Jurakin, Sep 12 '22 at 12:51
Possible duplicate: https://stackoverflow.com/questions/25941979/remove-nan-cells-without-dropping-the-entire-row-pandas-python3 — Jurakin, Sep 12 '22 at 12:57
If you can modify your input data, i recommend to use python `filter(np.nan, mylist)` — Jurakin, Sep 12 '22 at 13:00

mozway · Answer 1 · 2022-09-12T13:27:38.177

One option, by shifting the columns and filling the NaNs:

out = (pd.DataFrame(df.apply(sorted, key=pd.isna, axis=1).to_list(),
                    index=df.index, columns=df.columns)
         .fillna('')
       )

Or:

out = (df.apply(lambda x: pd.Series(x.dropna().values), axis=1)
         .fillna('')
         .set_axis(df.columns, axis=1)
       )

Or vectorial solution with numpy:

a = df.fillna('').to_numpy()
b = df.isna().to_numpy()

out = pd.DataFrame(a[np.arange(len(a))[:,None], np.argsort(b)],
                   index=df.index, columns=df.columns)

output:

    Word  Word2  Word3
0  Text1  Text2  Text3
1  Text1  Text2       
2  Text1              
3  Text1  Text2  Text3
4  Text1

score 0 · Answer 2 · answered Sep 12 '22 at 13:56

Here is another way:

#create a series to calculate amount of leading NaN's in each row
s = df.isna().cumprod(axis=1).sum(axis=1)

#shift each row according to helper series
df.apply(lambda x: x.shift(-s.loc[x.name]),axis=1)

Output:

    Word  Word2  Word3
0  Text1  Text2  Text3
1  Text1  Text2    NaN
2  Text1    NaN    NaN
3  Text1  Text2  Text3
4  Text1    NaN    NaN

Remove NaN Cells without dropping entire rows (Pandas, Python)

2 Answers2