I am parsing some data from a number pdf documents and storing them in a dataframe for insights. When writing to a pandas dataframe each page from the pdf document is not aligning all the data points under the same column needed.
One way I can fix this is to remove cells that contain NaNs and shift the non-null values left.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Word':['Text1', np.nan, np.nan, 'Text1', 'Text1'],
'Word2':['Text2', 'Text1', np.nan, 'Text2', np.nan],
'Word3':['Text3', 'Text2', 'Text1', 'Text3', np.nan]
})
df
Output of sample df:
Word Word2 Word3
0 Text1 Text2 Text3
1 NaN Text1 Text2
2 NaN NaN Text1
3 Text1 Text2 Text3
4 Text1 NaN NaN
Desired output needed:
Word Word2 Word3
0 Text1 Text2 Text3
1 Text1 Text2
2 Text1
3 Text1 Text2 Text3
4 Text1
In this example, only rows with index 1 and 2 actually change.
Any assistance would be much appreciated.
Alan