I created a data-capturing template. When imported into Python (as DataFrame), I noticed that some of the records spanned multiple rows.
I need to clean up the spanned record (see expected representation).
The 'Entity' column is the anchor column. Currently, it is not the definitive column, as one can see the row(s) underneath with NaN.
NB: Along the line, I'll be dropping the 'Unnamed:' column.
Essentially, for every row where df.Entity.isnull()
, the value(s) must be joined to the row above where df.Entity.notnull()
.
NB: I can adjust the source, however, I'll like to keep the source template because of ease of capturing and #reproducibility.
[dummy representation]
Unnamed: 0 | Entity | Country | Naming | Type: | Mode: | Duration | Page | Reg | Elig | Unit | Structure | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|---|
6. | Sur... | UK | ...Publ | Pros... | FT | Standard | Yes | Guid... 2021 | General | All | Intro & discussion... | Formal |
NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Assessment |
7. | War... | UK | by Publ... | Retro... | FT | 1 yr | Yes | Reg 38... | Staff | All | Covering Doc... | |
NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | General | NaN | 5000 <10000 | 3-8 publ... |
8. | EAng... | UK | Publ... | Retro... | PT | 6 mths | Yes | Reg... | General (Cat B) | All | Critical Anal... | Formal as |
NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Staff (Cat A) | NaN | 15000 | *g... |
NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Edu & LLL l... | NaN | NaN | LLL not... |
[expected representation] I expect to have
Unnamed | Entity | Country | Naming | Type: | Mode: | Duration | Page | Reg | Elig------------- | Unit | Structure ----- | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|---|
6. | Sur... | UK | ...Publ | Pros... | FT | Standard | Yes | Guid... 2021 | General | All | Intro & discussion... | Formal Assessment |
7. | War... | UK | by Publ... | Retro... | FT | 1 yr | Yes | Reg 38... | Staff General |
All | Covering Doc... 5000 <10000 |
Formal 3-8 publ... |
8. | EAng... | UK | Publ... | Retro... | PT | 6 mths | Yes | Reg... | General (Cat B) Staff (Cat A) Edu & LLL l... |
All | Critical Anal... 15000 |
Formal as *g... LLL not... |
My instinct is to test for isnull() on [Entity] column. I would prefer not to do a if...then/'loop' check.
My mind wandered along stack, groupby or stack, merge/join, pop. Not sure these approaches are 'right'.
My preference will be some 'vectorisation' as much as it's possible; taking advantage of pandas' DataFrame
I took note of
- Merging Two Rows (one with a value, the other NaN) in Pandas
- Pandas dataframe merging rows to remove NaN
- Concatenate column values in Pandas DataFrame with "NaN" values
- Merge DF of different Sizes with NaN values in between
In my case, my anchor column [Entity] has the key values on one row; however, its values are on one row or span multiple rows.
NB: I'm dealing with one DataFrame and not two df.
I should also mention that I took note of the SO solution that 'explode' newline across multiple rows. This is the opposite for my own scenario. However, I take note as it might provide hints.
[UPDATE: Walkaround 1]
NB: This walkaround is not a solution. It simply provide an alternate!
With leads from a Medium and a SO post,
I attempted with success reading my dataset directly from the table in the Word document. For this, I installed the python-docx
library.
## code snippet; #Note: incomplete
from docx import Document as docs
... ...
document = docs("datasets/IDA..._AppendixA.docx")
def read_docx_table(document, tab_id: int = None, nheader: int = 1, start_row: int = 0):
... ...
data = [[cell.text for cell in row.cells] for i, row in enumerate(table.rows)
if i >= start_row]
... ...
if nheader == 1: ## first row as column header
df = df.rename(columns=df.iloc[0]).drop(df.index[0]).reset_index(drop=True)
... ...
return df
... ...
## parse and show dataframe
df_table = read_docx_table(document, tab_id=3, nheader=1, start_row=0)
df_table
The rows are no longer spilling over multiple rows. The columns with newline are now showing the '\n' character.
I can, if I use df['col'].str.replace()
, remove newlines '\n' or other delimiters, if I so desire.
[dataframe representation: importing and parsing using python-docx library] Almost a true representation of the original table in Word
Unnamed | Entity | Country | Naming | Type: | Mode: | Duration | Page | Reg | Elig------------- | Unit | Structure ----- | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|---|
6. | Sur... | UK | ...Publ | Pros... | FT | Standard | Yes | Guid... 2021 | General | All | Intro & discussion... | Formal \n| Assessment |
7. | War... | UK | by Publ... | Retro... | FT | 1 yr | Yes | Reg 38... | Staff \nGeneral | All | Covering Doc... \n| 5000 <10000 | Formal \n| 3-8 publ... |
8. | EAng... | UK | Publ... | Retro... | PT | 6 mths | Yes | Reg... | General (Cat B) Staff (Cat A) \nEdu & LLL l... |
All | Critical Anal... \n| 15000 | Formal as \n|*g... \n| LLL not... |
[UPDATE 2]
After my update: walkaround 1, I saw @J_H comments. Whiltst it is not 'data corruption' in the true sense, it is nonetheless an ETL strategy issue. Thanks @J_H. Absolutely, well-thought-through #design is of the essence.
Going forward, I'll either leave the source template practically as-is with minor modifications and use python-docx
as I've used; or
I modify the source template for easy capture in Excel or 'csv' type repository.
Despite the two approaches outlined here or any other, I'm still keen on 'data cleaning code' that can clean-up the df, to give the expected df.