-1

I am parsing some data from a number pdf documents and storing them in a dataframe for insights. When writing to a pandas dataframe each page from the pdf document is not aligning all the data points under the same column needed.

One way I can fix this is to remove cells that contain NaNs and shift the non-null values left.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Word':['Text1', np.nan, np.nan, 'Text1', 'Text1'],
    'Word2':['Text2', 'Text1', np.nan, 'Text2', np.nan],
    'Word3':['Text3', 'Text2', 'Text1', 'Text3', np.nan]
})
df

Output of sample df:

    Word    Word2   Word3
0   Text1   Text2   Text3
1   NaN     Text1   Text2
2   NaN     NaN     Text1
3   Text1   Text2   Text3
4   Text1   NaN     NaN

Desired output needed:

    Word    Word2   Word3
0   Text1   Text2   Text3
1   Text1   Text2   
2   Text1   
3   Text1   Text2   Text3
4   Text1   

In this example, only rows with index 1 and 2 actually change.

Any assistance would be much appreciated.

Alan

Alan Paul
  • 91
  • 6

2 Answers2

1

One option, by shifting the columns and filling the NaNs:

out = (pd.DataFrame(df.apply(sorted, key=pd.isna, axis=1).to_list(),
                    index=df.index, columns=df.columns)
         .fillna('')
       )

Or:

out = (df.apply(lambda x: pd.Series(x.dropna().values), axis=1)
         .fillna('')
         .set_axis(df.columns, axis=1)
       )

Or vectorial solution with :

a = df.fillna('').to_numpy()
b = df.isna().to_numpy()

out = pd.DataFrame(a[np.arange(len(a))[:,None], np.argsort(b)],
                   index=df.index, columns=df.columns)

output:

    Word  Word2  Word3
0  Text1  Text2  Text3
1  Text1  Text2       
2  Text1              
3  Text1  Text2  Text3
4  Text1              
mozway
  • 194,879
  • 13
  • 39
  • 75
0

Here is another way:

#create a series to calculate amount of leading NaN's in each row
s = df.isna().cumprod(axis=1).sum(axis=1)

#shift each row according to helper series
df.apply(lambda x: x.shift(-s.loc[x.name]),axis=1)

Output:

    Word  Word2  Word3
0  Text1  Text2  Text3
1  Text1  Text2    NaN
2  Text1    NaN    NaN
3  Text1  Text2  Text3
4  Text1    NaN    NaN
rhug123
  • 7,893
  • 1
  • 9
  • 24