DataFrame cleanup: join/concatenate non-NaN values when there is NaN in-between rows (spin-off from row with 'unique' record))

Question

I created a data-capturing template. When imported into Python (as DataFrame), I noticed that some of the records spanned multiple rows.
I need to clean up the spanned record (see expected representation).

The 'Entity' column is the anchor column. Currently, it is not the definitive column, as one can see the row(s) underneath with NaN. NB: Along the line, I'll be dropping the 'Unnamed:' column.
Essentially, for every row where df.Entity.isnull(), the value(s) must be joined to the row above where df.Entity.notnull().

NB: I can adjust the source, however, I'll like to keep the source template because of ease of capturing and #reproducibility.

[dummy representation]

Unnamed: 0	Entity	Country	Naming	Type:	Mode:	Duration	Page	Reg	Elig	Unit	Structure	Notes
6.	Sur...	UK	...Publ	Pros...	FT	Standard	Yes	Guid... 2021	General	All	Intro & discussion...	Formal
NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Assessment
7.	War...	UK		by Publ...	Retro...	FT	1 yr	Yes	Reg 38...	Staff	All	Covering Doc...
NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	General	NaN	5000 <10000	3-8 publ...
8.	EAng...	UK	Publ...	Retro...	PT	6 mths	Yes	Reg...	General (Cat B)	All	Critical Anal...	Formal as
NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Staff (Cat A)	NaN	15000	*g...
NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Edu & LLL l...	NaN	NaN	LLL not...

[expected representation] I expect to have

Unnamed	Entity	Country	Naming	Type:	Mode:	Duration	Page	Reg	Elig-------------	Unit	Structure -----	Notes
6.	Sur...	UK	...Publ	Pros...	FT	Standard	Yes	Guid... 2021	General	All	Intro & discussion...	Formal Assessment
7.	War...	UK	by Publ...	Retro...	FT	1 yr	Yes	Reg 38...	Staff General	All	Covering Doc... 5000 <10000	Formal 3-8 publ...
8.	EAng...	UK	Publ...	Retro...	PT	6 mths	Yes	Reg...	General (Cat B) Staff (Cat A) Edu & LLL l...	All	Critical Anal... 15000	Formal as *g... LLL not...

My instinct is to test for isnull() on [Entity] column. I would prefer not to do a if...then/'loop' check.
My mind wandered along stack, groupby or stack, merge/join, pop. Not sure these approaches are 'right'.

My preference will be some 'vectorisation' as much as it's possible; taking advantage of pandas' DataFrame

I took note of

In my case, my anchor column [Entity] has the key values on one row; however, its values are on one row or span multiple rows.
NB: I'm dealing with one DataFrame and not two df.

I should also mention that I took note of the SO solution that 'explode' newline across multiple rows. This is the opposite for my own scenario. However, I take note as it might provide hints.

Pandas: How to read a DataFrame from excel-file where multiple rows are sometimes separated by line break (\n)

[UPDATE: Walkaround 1]
NB: This walkaround is not a solution. It simply provide an alternate!
With leads from a Medium and a SO post,

I attempted with success reading my dataset directly from the table in the Word document. For this, I installed the python-docx library.

## code snippet; #Note: incomplete
from docx import Document as docs
... ...
document = docs("datasets/IDA..._AppendixA.docx")
def read_docx_table(document, tab_id: int = None, nheader: int = 1, start_row: int = 0):
    ... ...
    data = [[cell.text for cell in row.cells] for i, row in enumerate(table.rows) 
        if i >= start_row]
    ... ...
    if nheader == 1:    ## first row as column header
        df = df.rename(columns=df.iloc[0]).drop(df.index[0]).reset_index(drop=True)
    ... ...
    return df
... ...
## parse and show dataframe
df_table = read_docx_table(document, tab_id=3, nheader=1, start_row=0)
df_table

The rows are no longer spilling over multiple rows. The columns with newline are now showing the '\n' character. I can, if I use df['col'].str.replace(), remove newlines '\n' or other delimiters, if I so desire.

Replacing part of string in python pandas dataframe

[dataframe representation: importing and parsing using python-docx library] Almost a true representation of the original table in Word

Unnamed	Entity	Country	Naming	Type:	Mode:	Duration	Page	Reg	Elig-------------	Unit	Structure -----	Notes
6.	Sur...	UK	...Publ	Pros...	FT	Standard	Yes	Guid... 2021	General	All	Intro & discussion...	Formal \n\| Assessment
7.	War...	UK	by Publ...	Retro...	FT	1 yr	Yes	Reg 38...	Staff \nGeneral	All	Covering Doc... \n\| 5000 <10000	Formal \n\| 3-8 publ...
8.	EAng...	UK	Publ...	Retro...	PT	6 mths	Yes	Reg...	General (Cat B) Staff (Cat A) \nEdu & LLL l...	All	Critical Anal... \n\| 15000	Formal as \n\|*g... \n\| LLL not...

[UPDATE 2] After my update: walkaround 1, I saw @J_H comments. Whiltst it is not 'data corruption' in the true sense, it is nonetheless an ETL strategy issue. Thanks @J_H. Absolutely, well-thought-through #design is of the essence.
Going forward, I'll either leave the source template practically as-is with minor modifications and use python-docx as I've used; or
I modify the source template for easy capture in Excel or 'csv' type repository.

Despite the two approaches outlined here or any other, I'm still keen on 'data cleaning code' that can clean-up the df, to give the expected df.

"I noticed that some of the records spanned multiple rows" -- this sounds like data **corruption**. That is, the input text was supposed to conform to some spec, such as CSV, but it maybe failed to escape quote delimeters / newlines correctly or otherwise deviated from the spec. After `pd.read_csv()` completes, pretty much it is too late. I recommend writing a filter that turns non-conformant text into a conformant CSV text file, and _then_ read_csv() that improved input. Telling the filter to deliberately discard unusable input records might be part of your ETL strategy. — J_H, Jan 15 '23 at 23:09
"... part of your ETL strategy" Thanks, @J_H I saw your comment after I added my [update: walkaround 1]; making use of `python-docx` library. With your valid, valuable comment, I added [update 2] to the question. I agree with you on 'relooking' the 'ETL strategy'. Your line of thought aligns with comments from my other [post](https://www.linkedin.com/feed/update/urn:li:activity:7019220166278266883/?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A7019220166278266883%2C7020529361917255680%29). Let's just say the ***'fix-it'*** in me wants to get the #mess working. I'm still open to working code. — semmyk-research, Jan 16 '23 at 10:28

DataFrame cleanup: join/concatenate non-NaN values when there is NaN in-between rows (spin-off from row with 'unique' record))

0 Answers0