Substitute multiple whitespace with single whitespace in Python

Question

I have this string:

mystring = 'Here is  some   text   I      wrote   '

How can I substitute the double, triple (...) whitespace chracters with a single space, so that I get:

mystring = 'Here is some text I wrote'

You should probably say 'substitute multiple whitespace with a single *space*' since whitespace is a class of characters (tabs, newlines etc.) — Noufal Ibrahim, Jan 16 '10 at 16:15

score 993 · Accepted Answer · answered Jan 16 '10 at 15:54

993

A simple possibility (if you'd rather avoid REs) is

' '.join(mystring.split())

The split and join perform the task you're explicitly asking about -- plus, they also do the extra one that you don't talk about but is seen in your example, removing trailing spaces;-).

answered Jan 16 '10 at 15:54

Alex Martelli

854,459
170
1,222
1,395

8

Oh cool, I was fumbling with a similar solution, but using split(' ') and then a filter to remove empty elements. I never knew split with no arguments worked like this. This is also much faster, timeit.py gives me around 0.74usec for this, versus 5.75usec for regular expressions. – Roman Jan 16 '10 at 16:00
19

@Roman, yes, `x.split()` (and `x.split(None)`) splits on _sequences of whitespace_ (including tabs, newlines, etc, like re's `\s`) of length 1+ -- and it's pretty fast indeed. So, always glad to help! – Alex Martelli Jan 16 '10 at 16:25
11

this is a very elegant solution, but I want to mention that this will also remove any linebreaks as well – trudolf Aug 24 '15 at 00:26
`str.split` also considers various characters (x0b, x0c, x1c, x1d, x1e, x1f) to be whitespace, and sometimes this is not intended. – Asclepius Feb 10 '20 at 01:30
Cleanest solution by far, and it seems, that is slightly (a little bit obvious) faster than doing regex, according to my tests. Seems like it doesn't apply to some specific situations like on the comments above, but you don't need to import a module to do the job, and probably, that's one of the reasons which is "slightly" faster (from 3 to 5 ms). – ivanleoncz Apr 10 '20 at 15:53
3

To avoid '\n' from being mixed with ' ' one can use splitlines() like this: ' '.join((''.join(text.splitlines())).split()) – Pradeep Singh Aug 25 '20 at 17:28
1

To only strip consecutive repeated spaces one can use `' '.join(mystring.split(' '))`. This will also remove the leading and trailing spaces but will keep newlines, tabs, etc. – FifthAxiom Jun 11 '22 at 12:05
Does `split()` match the same white space characters as `\s`? – dreamflasher Jan 18 '23 at 15:37

score 193 · Answer 2 · edited Mar 11 '20 at 09:14

A regular expression can be used to offer more control over the whitespace characters that are combined.

To match unicode whitespace:

import re

_RE_COMBINE_WHITESPACE = re.compile(r"\s+")

my_str = _RE_COMBINE_WHITESPACE.sub(" ", my_str).strip()

To match ASCII whitespace only:

import re

_RE_COMBINE_WHITESPACE = re.compile(r"(?a:\s+)")
_RE_STRIP_WHITESPACE = re.compile(r"(?a:^\s+|\s+$)")

my_str = _RE_COMBINE_WHITESPACE.sub(" ", my_str)
my_str = _RE_STRIP_WHITESPACE.sub("", my_str)

Matching only ASCII whitespace is sometimes essential for keeping control characters such as x0b, x0c, x1c, x1d, x1e, x1f.

Reference:

About \s:

For Unicode (str) patterns: Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched.

About re.ASCII:

Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns. Corresponds to the inline flag (?a).

strip() will remote any leading and trailing whitespaces.

If you really only want to replace spaces (' '), use `re.sub(' +', ' ', mystring).strip()` — Simon Hessner, Jul 16 '18 at 13:14

score 47 · Answer 3 · edited May 23 '17 at 11:47

For completeness, you can also use:

mystring = mystring.strip()  # the while loop will leave a trailing space, 
                  # so the trailing whitespace must be dealt with
                  # before or after the while loop
while '  ' in mystring:
    mystring = mystring.replace('  ', ' ')

which will work quickly on strings with relatively few spaces (faster than re in these situations).

In any scenario, Alex Martelli's split/join solution performs at least as quickly (usually significantly more so).

In your example, using the default values of timeit.Timer.repeat(), I get the following times:

str.replace: [1.4317800167340238, 1.4174888149192384, 1.4163512401715934]
re.sub:      [3.741931446594549,  3.8389395858970374, 3.973777672860706]
split/join:  [0.6530919432498195, 0.6252146571700905, 0.6346594329726258]

EDIT:

Just came across this post which provides a rather long comparison of the speeds of these methods.

More lines than the others, and thus less "pythonic", but clearer. — BuvinJ, Feb 11 '16 at 20:02
A reminder, this one has the risk of being infinite loop if you typo. — 林果皞, Jun 24 '20 at 14:23

Substitute multiple whitespace with single whitespace in Python

3 Answers3

Reference:

Linked

Related