1

Preface

It has been a while me not using python so I'm having issues with data cleaning. In notepad++ it goes really slow so I am looking for more efficient options in python.

What I need

I need to clean over 100 files in one directory, all of them were extracted manually from SAP.

Steps that I am looking for:

  1. Remove first line with -----
  2. Remove third line with -----
  3. Remove first and last character | from each line
  4. Remove whitespaces where needed - between text I need to keep them

Original File

---------------------------------------------------------------------------
|  MANDT|BUKRS|NETWR     |UMSKS|UMSKZ|AUGDT     |AUGBL|ZUONR              |
---------------------------------------------------------------------------
|  100  |1000 |23.321-   |     |     |          |     |TEXT I WANT TO KEEP|
|  100  |1000 |0.12      |     |     |          |     |TEXT I WANT TO KEEP|
|  100  |1500 |90        |     |     |          |     |TEXT I WANT TO KEEP|
---------------------------------------------------------------------------

Expected Outcome

MANDT|BUKRS|NETWR|UMSKS|UMSKZ|AUGDT|AUGBL|ZUONR
100|1000|23.321-|||||TEXT I WANT TO KEEP
100|1000|0.12|||||TEXT I WANT TO KEEP
100|1500|90|||||TEXT I WANT TO KEEP

The code here is what I'm trying to work with but I need help with regular expression composition. In Notepad++ I can use \h+(\w+)\h+ and as a replace \1 but here it doesn't work. Please help me to build a proper regex.

ctwheels
  • 21,901
  • 9
  • 42
  • 77
user2433705
  • 141
  • 1
  • 10

3 Answers3

3

Two approaches:

-- using built-in str object functions:

with open('yourfile.txt', 'r') as f:
    lines = f.read().splitlines()     # getting list of lines
    for l in lines:
        if not l.startswith('---'):   # skip dashed lines
            print('|'.join(map(str.strip, l.strip('|').split('|'))))

-- using re.sub() function:

with open('yourfile.txt', 'r') as f:
    lines = f.read().splitlines()
    for l in lines:
        if not l.startswith('---'):
            print(re.sub(r'\|\s*|\s*\|', '|', l).strip('|'))
            # an auxiliary pattern for complex cases:
            # re.sub(r'\|\s*(\S*)\s*(?=\|)', '|\\1', l).strip('|')

The output:

MANDT|BUKRS|NETWR|UMSKS|UMSKZ|AUGDT|AUGBL|ZUONR
100|1000|23.321-|||||TEXT I WANT TO KEEP
100|1000|0.12|||||TEXT I WANT TO KEEP
100|1500|90|||||TEXT I WANT TO KEEP

Read/write mode:

To overwrite the current file with new content use the following approach:

with open('yourfile.txt', 'r+') as f:   # 'r+' - read/write mode
    lines = f.read().splitlines()
    f.seek(0)      # reset file pointer
    f.truncate()   # truncating file contents
    for l in lines:
        if not l.startswith('---'):
            # or f.write('|'.join(map(str.strip, l.strip('|').split('|'))) + '\n')
            f.write(re.sub(r'\|\s*|\s*\|', '|', l).strip('|') + '\n')
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
  • clever! nice outside the box thinking vs just going down the regex route – sniperd Sep 05 '17 at 18:31
  • How I can overwrite the file I am using? I don't want to print it. I was trying with l='|'.join(map(str.strip, l.strip('|').split('|'))) but it did not work. – user2433705 Sep 05 '17 at 18:31
  • @user2433705, see my last approach – RomanPerekhrest Sep 05 '17 at 18:54
  • @RomanPerekhrest, there is one thing I still need to update - how to modify the regex to remove whitespaces that pop-up before numbers - I didn't include it in the original example. So in my case I have `| 100 |1000 | 23.321- | | | | |TEXT I WANT TO KEEP|` – user2433705 Sep 06 '17 at 07:56
  • @user2433705, see my update under comment `# an auxiliary pattern for complex cases:` – RomanPerekhrest Sep 06 '17 at 08:17
  • Can you please guide me how I should change the regex if instead of a pipe I have tab delimited files now? How I can change tabs to pipes? – user2433705 Sep 13 '17 at 21:13
0

WRT the code snippet in reference, you can use following patterns:

REGEXES = [(re.compile(r'^[-\n]+',re.M), ''),
       (re.compile(r'([\s]+)?\|([\s]+)?'), '|')]
abskmj
  • 760
  • 3
  • 6
  • It gives a syntax error on 'gm part - what does it mean? – user2433705 Sep 05 '17 at 18:25
  • I wanted to use `^` to indicate the start of a line which needs a multi line flag to be set. [Example](https://regex101.com/r/YYgpoN/1). I have updated my answer with correct syntax. – abskmj Sep 06 '17 at 14:41
0

Preface

Assuming all your files are in the same format, you could simply use a regex replace.

Answer

You can view this in use on regex101 here

Explanation

This regex [\t ]|-{2,}\s*|^\||\|$ will:

  1. Catch all tab or space characters
  2. Catch all - characters where two such characters follow each other (as well as any following whitespace characters)
  3. Catch all lines beginning with the | character
  4. Catch all lines ending with the | character

Note that you must ensure global g and multi-line m modifiers are active.


Code

Your final code should resemble the following:

import re

regex = r"[\t ]|-{2,}\s*|^\||\|$"

subst = ""

result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

if result:
    print (result)

Where test_str contains the file's contents (such as below)

---------------------------------------------------------------------------
|  MANDT|BUKRS|NETWR     |UMSKS|UMSKZ|AUGDT     |AUGBL|ZUONR              |
---------------------------------------------------------------------------
|  100  |1000 |23.321-   |     |     |          |     |TEXT I WANT TO KEEP|
|  100  |1000 |0.12      |     |     |          |     |TEXT I WANT TO KEEP|
|  100  |1500 |90        |     |     |          |     |TEXT I WANT TO KEEP|
---------------------------------------------------------------------------

Output

MANDT|BUKRS|NETWR|UMSKS|UMSKZ|AUGDT|AUGBL|ZUONR
100|1000|23.321-|||||TEXTIWANTTOKEEP
100|1000|0.12|||||TEXTIWANTTOKEEP
100|1500|90|||||TEXTIWANTTOKEEP

Edit

Answer

You can view this in use on regex101 here

Explanation

(?:^\|[\t ]*)|(?:[\t ]*\|$)|(?:(?<=\|)[\t ]*)|(?:[\t ]*(?=\|))|(?:-{2,}\s*)

The regex above will:

  1. Catch | (only at the beginning of a line) followed by any number of tab or space characters
  2. Catch any number of tab or space characters followed by | (only at the end of a line)
  3. Catch any number of tab or space characters that follow |
  4. Catch any number of tab or space characters that precede |
  5. Catch all - characters where two such characters follow each other (as well as any following whitespace characters)

Note that you must ensure global g and multi-line m modifiers are active.


Code

Your final code should resemble the following:

import re

regex = r"(?:^\|[\t ]*)|(?:[\t ]*\|$)|(?:(?<=\|)[\t ]*)|(?:[\t ]*(?=\|))|(?:-{2,}\s*)"

subst = ""

result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

if result:
    print (result)

Where test_str contains the file's contents (such as below)

---------------------------------------------------------------------------
|  MANDT|BUKRS|NETWR     |UMSKS|UMSKZ|AUGDT     |AUGBL|ZUONR              |
---------------------------------------------------------------------------
|  100  |1000 |23.321-   |     |     |          |     |TEXT I WANT TO KEEP|
|  100  |1000 |0.12      |     |     |          |     |TEXT I WANT TO KEEP|
|  100  |1500 |90        |     |     |          |     |TEXT I WANT TO KEEP|
---------------------------------------------------------------------------

Output

MANDT|BUKRS|NETWR|UMSKS|UMSKZ|AUGDT|AUGBL|ZUONR
100|1000|23.321-|||||TEXT I WANT TO KEEP
100|1000|0.12|||||TEXT I WANT TO KEEP
100|1500|90|||||TEXT I WANT TO KEEP
ctwheels
  • 21,901
  • 9
  • 42
  • 77
  • Many thanks but how to adjust it so it not removes spaces between text? I have some text attributes there as well where I need to keep spaces. – user2433705 Sep 05 '17 at 18:30