1

The answer to the question at Python remove all whitespace in a string shows separate ways to remove leading/ending, duplicated, and all spaces, respectively, from a string in Python. But strip() removes tabs and newlines, and lstrip() only affects leading spaces. The solution using .join(sentence.split()) also appears to remove Unicode whitespace characters.

Suppose I have a string, in this case scraped from a website using Scrapy, like this:

['\n                        \n                    ',
         '\n                        ',
         'Some text',
         ' and some more text\n',
  ' and on another a line some more text', '
                ']

The newlines preserve formatting of the text when I use it in another contexts, but all the extra space is a nuisance. How do I remove all the leading, ending, and duplicated internal spaces while preserving the newline characters (in addition to any \r or \t characters, if there are any)?

The result I want (after I join the individual strings) would then be:

['\n\n\nSome text and some more text\nand on another line some more text']

No sample code is provided because what I've tried so far is just the suggestions on the page referenced above, which gets the results I'm trying to avoid.

NFB
  • 642
  • 8
  • 26

2 Answers2

4

In that case str.strip() won't help you (even if you use " " as an argument because it won't remove the spaces inside, only at the start/end of your string, and it would remove the single space before "and" as well.

Instead, use regex to remove 2 or more spaces from your strings:

l= ['\n                        \n                    ',
         '\n                        ',
         'Some text',
         ' and some more text\n',
  ' and on another a line some more text']

import re

result = "".join([re.sub("  +","",x) for x in l])

print(repr(result))

prints:

'\n\n\nSome text and some more text\n and on another a line some more text'

EDIT: if we apply the regex to each line, we cannot detect \n in some cases, as you noted. So, the alternate and more complex solution would be to join the strings before applying regex, and apply a more complex regex (note that I changed the test list of strings to add more corner cases):

l= ['\n                        \n                    ',
         '\n                        ',
         'Some text',
         ' and some more text \n',
  '\n and on another a line some more text ']

import re

result = re.sub("(^ |(?<=\n) |  +| (?=\n)| $)","","".join(l))

print(repr(result))

prints:

'\n\n\nSome text and some more text\n\nand on another a line some more text'

There are 5 cases in the regex now that will be removed:

  • start by one space
  • space following a newline
  • 2 or more spaces
  • space followed by a newline
  • end by one space

Aftertought: looks (and is) complicated. There is a non-regex solution after all which gives exactly the same result (if there aren't multiple spaces between words):

result = "\n".join([x.strip(" ") for x in "".join(l).split("\n")])
print(repr(result))

just join the strings, then split according to newline, apply strip with " " as argument to preserve tabs, and join again according to newline.

Chain with re.sub(" +"," ",x.strip(" ")) to take care of possible double spaces between words:

result = "\n".join([re.sub("  +"," ",x.strip(" ")) for x in "".join(l).split("\n")])
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
  • This is helpful and almost completely answers it - except that if a single space begins a new line, as in the example here, it doesn't get stripped. – NFB Jun 28 '17 at 19:58
  • @NFB: now the answer looks pretty complete and there are even 2 alternatives now :) – Jean-François Fabre Jun 28 '17 at 20:16
  • This completely solves the original example, however it does have one unintended effect not illustrated by my example string. If there are multiple spaces between words, it reduces them to no space at all, rather than just one space. However I'm marking it answered since it does solve the original example. Also a great range of possible solutions. Thank you! – NFB Jun 29 '17 at 02:20
  • Unless I am missing something, I am still getting double spaces removed, not replaced with single spaces. Which code snippet was fixed to replace many spaces by one, the longer or second (shorter) one? – NFB Jun 29 '17 at 22:06
2

You can also do the whole thing in terms of built in string operations if you like.

l = ['\n                        \n                    ',
     '\n                        ',
     'Some text',
     ' and some more text\n',
     ' and on another a      line some more text',
     '              ']


def remove_duplicate_spaces(l):
    words = [w for w in l.split(' ') if w != '']
    return ' '.join(words)

lines = ''.join(l).split('\n')
formatted_lines = map(remove_duplicate_spaces, lines)
u = "\n".join(formatted_lines)

print(repr(u))

gives

'\n\n\nSome text and some more text\nand on another a line some more text'

You can also collapse the whole thing into a one-liner:

s = '\n'.join([' '.join([s for s in x.strip(' ').split(' ') if s!='']) for x in ''.join(l).split('\n')])

# OR

t = '\n'.join(map(lambda x: ' '.join(filter(lambda s: s!='', x.strip(' ').split(' '))), ''.join(l).split('\n')))
jpeoples
  • 121
  • 5