I have a poorly structured dataframe that was generated by reading tables in directly from a pdf.
I am trying to manipulate some of the data before putting it into visualization tools.
A key transformation I am trying to make is to extract a column header and use it as a row label. Here is an example of the kind of dataframe I am working with:
data = {'Col1': ['Alabama', 'nan', 'nan', 'nan', 'Wyoming', 'nan', 'nan', 'nan'],
'Col2': ['nan', 1, 2, 3, 'nan', 1, 2, 3]}
df = pd.DataFrame(data)
The resulting dataframe looks a bit like this:
Col1 Col2
0 AL nan
1 nan 1
2 nan 2
3 nan 3
4 WY nan
5 nan 1
6 nan 2
7 nan 3
Whereby the entries in Col 1 are mostly nan except for those on row 0 (AL) and row 4 (WY). These were effectively subheaders in the table in the pdf.
I am trying to write a code that takes the last valid value in Col1 (e.g., AL) and then fills the remaining rows below it until it encounters the next valid value (e.g., WY).
Correct output would look something like this:
Col1 Col2
0 AL nan
1 AL 1
2 AL 2
3 AL 3
4 WY nan
5 WY 1
6 WY 2
7 WY 3
I am somewhat at a loss for how to proceed here and welcome any advise on where to start out.