0

I created a data-capturing template. When imported into Python (as DataFrame), I noticed that some of the records spanned multiple rows.
I need to clean up the spanned record (see expected representation).

The 'Entity' column is the anchor column. Currently, it is not the definitive column, as one can see the row(s) underneath with NaN. NB: Along the line, I'll be dropping the 'Unnamed:' column.
Essentially, for every row where df.Entity.isnull(), the value(s) must be joined to the row above where df.Entity.notnull().

NB: I can adjust the source, however, I'll like to keep the source template because of ease of capturing and #reproducibility.

[dummy representation]

Unnamed: 0 Entity Country Naming Type: Mode: Duration Page Reg Elig Unit Structure Notes
6. Sur... UK ...Publ Pros... FT Standard Yes Guid... 2021 General All Intro & discussion... Formal
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Assessment
7. War... UK by Publ... Retro... FT 1 yr Yes Reg 38... Staff All Covering Doc...
NaN NaN NaN NaN NaN NaN NaN NaN NaN General NaN 5000 <10000 3-8 publ...
8. EAng... UK Publ... Retro... PT 6 mths Yes Reg... General (Cat B) All Critical Anal... Formal as
NaN NaN NaN NaN NaN NaN NaN NaN NaN Staff (Cat A) NaN 15000 *g...
NaN NaN NaN NaN NaN NaN NaN NaN NaN Edu & LLL l... NaN NaN LLL not...

[expected representation] I expect to have

Unnamed Entity Country Naming Type: Mode: Duration Page Reg Elig------------- Unit Structure ----- Notes
6. Sur... UK ...Publ Pros... FT Standard Yes Guid... 2021 General All Intro & discussion... Formal
Assessment
7. War... UK by Publ... Retro... FT 1 yr Yes Reg 38... Staff
General
All Covering Doc...
5000 <10000
Formal
3-8 publ...
8. EAng... UK Publ... Retro... PT 6 mths Yes Reg... General (Cat B)
Staff (Cat A)
Edu & LLL l...
All Critical Anal...
15000
Formal as
*g...
LLL not...

My instinct is to test for isnull() on [Entity] column. I would prefer not to do a if...then/'loop' check.
My mind wandered along stack, groupby or stack, merge/join, pop. Not sure these approaches are 'right'.

My preference will be some 'vectorisation' as much as it's possible; taking advantage of pandas' DataFrame

I took note of

In my case, my anchor column [Entity] has the key values on one row; however, its values are on one row or span multiple rows.
NB: I'm dealing with one DataFrame and not two df.

I should also mention that I took note of the SO solution that 'explode' newline across multiple rows. This is the opposite for my own scenario. However, I take note as it might provide hints.

[UPDATE: Walkaround 1]
NB: This walkaround is not a solution. It simply provide an alternate!
With leads from a Medium and a SO post,

I attempted with success reading my dataset directly from the table in the Word document. For this, I installed the python-docx library.

## code snippet; #Note: incomplete
from docx import Document as docs
... ...
document = docs("datasets/IDA..._AppendixA.docx")
def read_docx_table(document, tab_id: int = None, nheader: int = 1, start_row: int = 0):
    ... ...
    data = [[cell.text for cell in row.cells] for i, row in enumerate(table.rows) 
        if i >= start_row]
    ... ...
    if nheader == 1:    ## first row as column header
        df = df.rename(columns=df.iloc[0]).drop(df.index[0]).reset_index(drop=True)
    ... ...
    return df
... ...
## parse and show dataframe
df_table = read_docx_table(document, tab_id=3, nheader=1, start_row=0)
df_table

The rows are no longer spilling over multiple rows. The columns with newline are now showing the '\n' character. I can, if I use df['col'].str.replace(), remove newlines '\n' or other delimiters, if I so desire.

[dataframe representation: importing and parsing using python-docx library] Almost a true representation of the original table in Word

Unnamed Entity Country Naming Type: Mode: Duration Page Reg Elig------------- Unit Structure ----- Notes
6. Sur... UK ...Publ Pros... FT Standard Yes Guid... 2021 General All Intro & discussion... Formal \n| Assessment
7. War... UK by Publ... Retro... FT 1 yr Yes Reg 38... Staff \nGeneral All Covering Doc... \n| 5000 <10000 Formal \n| 3-8 publ...
8. EAng... UK Publ... Retro... PT 6 mths Yes Reg... General (Cat B)
Staff (Cat A) \nEdu & LLL l...
All Critical Anal... \n| 15000 Formal as \n|*g... \n| LLL not...

[UPDATE 2] After my update: walkaround 1, I saw @J_H comments. Whiltst it is not 'data corruption' in the true sense, it is nonetheless an ETL strategy issue. Thanks @J_H. Absolutely, well-thought-through #design is of the essence.
Going forward, I'll either leave the source template practically as-is with minor modifications and use python-docx as I've used; or
I modify the source template for easy capture in Excel or 'csv' type repository.

Despite the two approaches outlined here or any other, I'm still keen on 'data cleaning code' that can clean-up the df, to give the expected df.

semmyk-research
  • 333
  • 1
  • 9
  • "I noticed that some of the records spanned multiple rows" -- this sounds like data **corruption**. That is, the input text was supposed to conform to some spec, such as CSV, but it maybe failed to escape quote delimeters / newlines correctly or otherwise deviated from the spec. After `pd.read_csv()` completes, pretty much it is too late. I recommend writing a filter that turns non-conformant text into a conformant CSV text file, and _then_ read_csv() that improved input. Telling the filter to deliberately discard unusable input records might be part of your ETL strategy. – J_H Jan 15 '23 at 23:09
  • "... part of your ETL strategy" Thanks, @J_H I saw your comment after I added my [update: walkaround 1]; making use of `python-docx` library. With your valid, valuable comment, I added [update 2] to the question. I agree with you on 'relooking' the 'ETL strategy'. Your line of thought aligns with comments from my other [post](https://www.linkedin.com/feed/update/urn:li:activity:7019220166278266883/?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A7019220166278266883%2C7020529361917255680%29). Let's just say the ***'fix-it'*** in me wants to get the #mess working. I'm still open to working code. – semmyk-research Jan 16 '23 at 10:28

0 Answers0