Removing white space from txt with python

Question

I have a .txt file (scraped as pre-formatted text from a website) where the data looks like this:

B, NICKOLAS                       CT144531X       D1026    JUDGE ANNIE WHITE JOHNSON  
ANDREWS VS BALL                   JA-15-0050      D0015    JUDGE EDWARD A ROBERTS

I'd like to remove all extra spaces (they're actually different number of spaces, not tabs) in between the columns. I'd also then like to replace it with some delimiter (tab or pipe since there's commas within the data), like so:

ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS

Looked around and found that the best options are using regex or shlex to split. Two similar scenarios:

Is there always more than one space between the data columns? — OneCricketeer, Apr 30 '16 at 17:25
You shouldn't use regex, because the data might have 2 or more spaces in a column. The data is in a text-table format, sp columns should be determined by the position of a character on each line. — Jonathan Vanasco, May 01 '16 at 16:45

Ahsanul Haque · Answer 1 · 2016-08-28T14:36:18.090

What about this?

your_string ='ANDREWS VS BALL                   JA-15-0050      D0015    JUDGE EDWARD A ROBERTS'
print re.sub(r'\s{2,}','|',your_string.strip())

Output:

ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS

Expanation:

I've used re.sub() which takes 3 parameter, a pattern, a string you want to replace with and the string you want to work on.

What I've done is taking at least two space together , I 've replaced them with a | and applied it on your string.

score 7 · Answer 2 · answered Apr 30 '16 at 17:33

You can apply the regex '\s{2,}' (two or more whitespace characters) to each line and substitute the matches with a single '|' character.

>>> import re
>>> line = 'ANDREWS VS BALL                   JA-15-0050      D0015    JUDGE EDWARD A ROBERTS        '
>>> re.sub('\s{2,}', '|', line.strip())
'ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS'

Stripping any leading and trailing whitespace from the line before applying re.sub ensures that you won't get '|' characters at the start and end of the line.

Your actual code should look similar to this:

import re
with open(filename) as f:
    for line in f:
        subbed = re.sub('\s{2,}', '|', line.strip())
        # do something here

Thank you, really nicely explained. Just needed to add a linebreak function. — aysha, Apr 30 '16 at 20:08

VelikiiNehochuha · Accepted Answer · 2016-04-30T18:53:53.637

5

s = """B, NICKOLAS                       CT144531X       D1026    JUDGE ANNIE WHITE JOHNSON  
ANDREWS VS BALL                   JA-15-0050      D0015    JUDGE EDWARD A ROBERTS
"""

# Update
re.sub(r"(\S)\ {2,}(\S)(\n?)", r"\1|\2\3", s)
In [71]: print re.sub(r"(\S)\ {2,}(\S)(\n?)", r"\1|\2\3", s)
B, NICKOLAS|CT144531X|D1026|JUDGE ANNIE WHITE JOHNSON  
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS

edited Apr 30 '16 at 18:53

answered Apr 30 '16 at 17:48

VelikiiNehochuha

3,775
2
15
32

It appends an extra pipe at the end. – Sнаđошƒаӽ Apr 30 '16 at 18:09
Perfect! Clean and simple, plus it adds line breaks nicely. – aysha Apr 30 '16 at 20:07

Sнаđошƒаӽ · Answer 4 · 2016-04-30T17:58:12.657

3

Considering there are at least two spaces separating the columns, you can use this:

lines = [
'B, NICKOLAS                       CT144531X       D1026    JUDGE ANNIE WHITE JOHNSON  ',
'ANDREWS VS BALL                   JA-15-0050      D0015    JUDGE EDWARD A ROBERTS        '
]

for line in lines:
    parts = []
    for part in line.split('  '):
        part = part.strip()
        if part:  # checking if stripped part is a non-empty string
            parts.append(part)
    print('|'.join(parts))

Output for your input:

B, NICKOLAS|CT144531X|D1026|JUDGE ANNIE WHITE JOHNSON
ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS

edited Apr 30 '16 at 17:58

answered Apr 30 '16 at 17:27

Sнаđошƒаӽ

16,753
12
73
90

1

Did your test it before posting as an answer? – Ahsanul Haque Apr 30 '16 at 17:29
How can this give `ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS`? – Peter Wood Apr 30 '16 at 17:31
care to explain the downvote? @PeterWood Edited code gives the required output. – Sнаđошƒаӽ Apr 30 '16 at 17:33
1

Basically you are creating a list with a whole bunch of `''` and skipping them with an if statement. Besides, `part.strip()` is called twice, once for comparing and once for appending two the list. IMHO, shorter code doesn't mean faster. – Ahsanul Haque Apr 30 '16 at 17:44
@AhsanulHaque where did I say that it is good solution?? And what's the point of becoming so offensive?? You have done yours with re, but I did w/o re, but it is still a solution. – Sнаđошƒаӽ Apr 30 '16 at 17:45
@Sнаđошƒаӽ Sorry that you are hurt,I didn't really mean to. But you could see the points in updated comment. I think, it's logical. After optimising them, your answer could be great. – Ahsanul Haque Apr 30 '16 at 17:51

Jonathan Vanasco · Answer 5 · 2016-05-01T16:44:15.813

It looks like your data is in a "text-table" format.

I recommend using the first row to figure out the start point and length of each column (either by hand or write a script with regex to determine the likely columns), then writing a script to iterate the rows of the file, slice the row into column segments, and apply strip to each segment.

If you use a regex, you must keep track of the number of columns and raise an error if any given row has more than the expected number of columns (or a different number than the rest). Splitting on two-or-more spaces will break if a column's value has two-or-more spaces, which is not just entirely possible, but also likely. Text-tables like this aren't designed to be split on a regex, they're designed to be split on the column index positions.

In terms of saving the data, you can use the csv module to write/read into a csv file. That will let you handle quoting and escaping characters better than specifying a delimiter. If one of your columns has a | character as a value, unless you're encoding the data with a strategy that handles escapes or quoted literals, your output will break on read.

Parsing the text above would look something like this (i nested a list comprehension with brackets instead of the traditional format so it's easier to understand):

cols = ((0,34),
        (34, 50),
        (50, 59),
        (59, None),
        )
for line in lines:
    cleaned = [i.strip() for i in [line[s:e] for (s, e) in cols]]
    print cleaned

then you can write it with something like:

import csv
with open('output.csv', 'wb') as csvfile:
    spamwriter = csv.writer(csvfile, delimiter='|',
                            quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for line in lines:
        spamwriter.writerow([line[col_start:col_end].strip()
                             for (col_start, col_end) in cols
                             ])

score 0 · Answer 6 · answered May 12 '16 at 21:13

0

Looks like this library can solve this quite nicely: http://docs.astropy.org/en/stable/io/ascii/fixed_width_gallery.html#fixed-width-gallery

Impressive...

answered May 12 '16 at 21:13

Fruch

408
5
18

Removing white space from txt with python

6 Answers6