Preface
It has been a while me not using python so I'm having issues with data cleaning. In notepad++ it goes really slow so I am looking for more efficient options in python.
What I need
I need to clean over 100 files in one directory, all of them were extracted manually from SAP.
Steps that I am looking for:
- Remove first line with
-----
- Remove third line with
-----
- Remove first and last character
|
from each line - Remove whitespaces where needed - between text I need to keep them
Original File
---------------------------------------------------------------------------
| MANDT|BUKRS|NETWR |UMSKS|UMSKZ|AUGDT |AUGBL|ZUONR |
---------------------------------------------------------------------------
| 100 |1000 |23.321- | | | | |TEXT I WANT TO KEEP|
| 100 |1000 |0.12 | | | | |TEXT I WANT TO KEEP|
| 100 |1500 |90 | | | | |TEXT I WANT TO KEEP|
---------------------------------------------------------------------------
Expected Outcome
MANDT|BUKRS|NETWR|UMSKS|UMSKZ|AUGDT|AUGBL|ZUONR
100|1000|23.321-|||||TEXT I WANT TO KEEP
100|1000|0.12|||||TEXT I WANT TO KEEP
100|1500|90|||||TEXT I WANT TO KEEP
The code here is what I'm trying to work with but I need help with regular expression composition. In Notepad++ I can use \h+(\w+)\h+
and as a replace \1
but here it doesn't work. Please help me to build a proper regex.