Python multiple regexes cleaning the file

Question

Preface

It has been a while me not using python so I'm having issues with data cleaning. In notepad++ it goes really slow so I am looking for more efficient options in python.

What I need

I need to clean over 100 files in one directory, all of them were extracted manually from SAP.

Steps that I am looking for:

Remove first line with -----
Remove third line with -----
Remove first and last character | from each line
Remove whitespaces where needed - between text I need to keep them

Original File

---------------------------------------------------------------------------
|  MANDT|BUKRS|NETWR     |UMSKS|UMSKZ|AUGDT     |AUGBL|ZUONR              |
---------------------------------------------------------------------------
|  100  |1000 |23.321-   |     |     |          |     |TEXT I WANT TO KEEP|
|  100  |1000 |0.12      |     |     |          |     |TEXT I WANT TO KEEP|
|  100  |1500 |90        |     |     |          |     |TEXT I WANT TO KEEP|
---------------------------------------------------------------------------

Expected Outcome

MANDT|BUKRS|NETWR|UMSKS|UMSKZ|AUGDT|AUGBL|ZUONR
100|1000|23.321-|||||TEXT I WANT TO KEEP
100|1000|0.12|||||TEXT I WANT TO KEEP
100|1500|90|||||TEXT I WANT TO KEEP

The code here is what I'm trying to work with but I need help with regular expression composition. In Notepad++ I can use \h+(\w+)\h+ and as a replace \1 but here it doesn't work. Please help me to build a proper regex.

Ignore lines without `|`, Split on `|`, Trim, Join value with `|`. No regex needed with this algo. — Drag and Drop, Sep 05 '17 at 17:50
Thing is as I mentioned I haven't used python for a while - 1 year or so I think. Can you share some code pieces? — user2433705, Sep 05 '17 at 17:52

RomanPerekhrest · Answer 1 · 2017-09-06T08:16:38.017

3

Two approaches:

-- using built-in str object functions:

with open('yourfile.txt', 'r') as f:
    lines = f.read().splitlines()     # getting list of lines
    for l in lines:
        if not l.startswith('---'):   # skip dashed lines
            print('|'.join(map(str.strip, l.strip('|').split('|'))))

-- using re.sub() function:

with open('yourfile.txt', 'r') as f:
    lines = f.read().splitlines()
    for l in lines:
        if not l.startswith('---'):
            print(re.sub(r'\|\s*|\s*\|', '|', l).strip('|'))
            # an auxiliary pattern for complex cases:
            # re.sub(r'\|\s*(\S*)\s*(?=\|)', '|\\1', l).strip('|')

The output:

MANDT|BUKRS|NETWR|UMSKS|UMSKZ|AUGDT|AUGBL|ZUONR
100|1000|23.321-|||||TEXT I WANT TO KEEP
100|1000|0.12|||||TEXT I WANT TO KEEP
100|1500|90|||||TEXT I WANT TO KEEP

Read/write mode:

To overwrite the current file with new content use the following approach:

with open('yourfile.txt', 'r+') as f:   # 'r+' - read/write mode
    lines = f.read().splitlines()
    f.seek(0)      # reset file pointer
    f.truncate()   # truncating file contents
    for l in lines:
        if not l.startswith('---'):
            # or f.write('|'.join(map(str.strip, l.strip('|').split('|'))) + '\n')
            f.write(re.sub(r'\|\s*|\s*\|', '|', l).strip('|') + '\n')

edited Sep 06 '17 at 08:16

answered Sep 05 '17 at 18:02

RomanPerekhrest

88,541
4
65
105

clever! nice outside the box thinking vs just going down the regex route – sniperd Sep 05 '17 at 18:31
How I can overwrite the file I am using? I don't want to print it. I was trying with l='|'.join(map(str.strip, l.strip('|').split('|'))) but it did not work. – user2433705 Sep 05 '17 at 18:31
@user2433705, see my last approach – RomanPerekhrest Sep 05 '17 at 18:54
@RomanPerekhrest, there is one thing I still need to update - how to modify the regex to remove whitespaces that pop-up before numbers - I didn't include it in the original example. So in my case I have `| 100 |1000 | 23.321- | | | | |TEXT I WANT TO KEEP|` – user2433705 Sep 06 '17 at 07:56
@user2433705, see my update under comment `# an auxiliary pattern for complex cases:` – RomanPerekhrest Sep 06 '17 at 08:17
Can you please guide me how I should change the regex if instead of a pipe I have tab delimited files now? How I can change tabs to pipes? – user2433705 Sep 13 '17 at 21:13

abskmj · Answer 2 · 2017-09-06T14:44:22.827

0

WRT the code snippet in reference, you can use following patterns:

REGEXES = [(re.compile(r'^[-\n]+',re.M), ''),
       (re.compile(r'([\s]+)?\|([\s]+)?'), '|')]

edited Sep 06 '17 at 14:44

answered Sep 05 '17 at 18:20

abskmj

760
3
6

It gives a syntax error on 'gm part - what does it mean? – user2433705 Sep 05 '17 at 18:25
I wanted to use `^` to indicate the start of a line which needs a multi line flag to be set. [Example](https://regex101.com/r/YYgpoN/1). I have updated my answer with correct syntax. – abskmj Sep 06 '17 at 14:41

ctwheels · Answer 3 · 2017-09-05T18:40:17.283

Preface

Assuming all your files are in the same format, you could simply use a regex replace.

Answer

You can view this in use on regex101 here

Explanation

This regex [\t ]|-{2,}\s*|^\||\|$ will:

Catch all tab or space characters
Catch all - characters where two such characters follow each other (as well as any following whitespace characters)
Catch all lines beginning with the | character
Catch all lines ending with the | character

Note that you must ensure global g and multi-line m modifiers are active.

Code

Your final code should resemble the following:

import re

regex = r"[\t ]|-{2,}\s*|^\||\|$"

subst = ""

result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

if result:
    print (result)

Where test_str contains the file's contents (such as below)

---------------------------------------------------------------------------
|  MANDT|BUKRS|NETWR     |UMSKS|UMSKZ|AUGDT     |AUGBL|ZUONR              |
---------------------------------------------------------------------------
|  100  |1000 |23.321-   |     |     |          |     |TEXT I WANT TO KEEP|
|  100  |1000 |0.12      |     |     |          |     |TEXT I WANT TO KEEP|
|  100  |1500 |90        |     |     |          |     |TEXT I WANT TO KEEP|
---------------------------------------------------------------------------

Output

MANDT|BUKRS|NETWR|UMSKS|UMSKZ|AUGDT|AUGBL|ZUONR
100|1000|23.321-|||||TEXTIWANTTOKEEP
100|1000|0.12|||||TEXTIWANTTOKEEP
100|1500|90|||||TEXTIWANTTOKEEP

Edit

Answer

You can view this in use on regex101 here

Explanation

(?:^\|[\t ]*)|(?:[\t ]*\|$)|(?:(?<=\|)[\t ]*)|(?:[\t ]*(?=\|))|(?:-{2,}\s*)

The regex above will:

Catch | (only at the beginning of a line) followed by any number of tab or space characters
Catch any number of tab or space characters followed by | (only at the end of a line)
Catch any number of tab or space characters that follow |
Catch any number of tab or space characters that precede |
Catch all - characters where two such characters follow each other (as well as any following whitespace characters)

Note that you must ensure global g and multi-line m modifiers are active.

Code

Your final code should resemble the following:

import re

regex = r"(?:^\|[\t ]*)|(?:[\t ]*\|$)|(?:(?<=\|)[\t ]*)|(?:[\t ]*(?=\|))|(?:-{2,}\s*)"

subst = ""

result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

if result:
    print (result)

Where test_str contains the file's contents (such as below)

---------------------------------------------------------------------------
|  MANDT|BUKRS|NETWR     |UMSKS|UMSKZ|AUGDT     |AUGBL|ZUONR              |
---------------------------------------------------------------------------
|  100  |1000 |23.321-   |     |     |          |     |TEXT I WANT TO KEEP|
|  100  |1000 |0.12      |     |     |          |     |TEXT I WANT TO KEEP|
|  100  |1500 |90        |     |     |          |     |TEXT I WANT TO KEEP|
---------------------------------------------------------------------------

Output

MANDT|BUKRS|NETWR|UMSKS|UMSKZ|AUGDT|AUGBL|ZUONR
100|1000|23.321-|||||TEXT I WANT TO KEEP
100|1000|0.12|||||TEXT I WANT TO KEEP
100|1500|90|||||TEXT I WANT TO KEEP

Many thanks but how to adjust it so it not removes spaces between text? I have some text attributes there as well where I need to keep spaces. — user2433705, Sep 05 '17 at 18:30

Python multiple regexes cleaning the file

Preface

What I need

Original File

Expected Outcome

3 Answers3

Preface

Answer

Explanation

Code

Output

Edit

Answer

Explanation

Code

Output