1

This question has been asked before, but the fast answers that I have seen also remove the trailing spaces, which I don't want.

"   a     bc    "

should become

" a bc "

I have

text = re.sub(' +', " ", text)

but am hoping for something faster. The suggestion that I have seen (and which won't work) is

' '.join(text.split())

Note that I will be doing this to lots of smaller texts so just checking for a trailing space won't be so great.

Andomar
  • 232,371
  • 49
  • 380
  • 404
user984003
  • 28,050
  • 64
  • 189
  • 285
  • 1
    If you want to really optimize stuff like this, use C, not python. Try cython, that is pretty much Python syntax but fast as C. – Has QUIT--Anony-Mousse Jun 13 '13 at 15:13
  • 1
    You could try `''.join((text[0],' '.join(text[1:-1].split()),text[-1]))` but that is probably not faster than the regex (you'd need to timeit), and it's definitely not easier to read. – mgilson Jun 13 '13 at 15:14
  • Have you checked that this is really the thing slowing down your program? My (very uninformed) guess is that it is not. First profile, and then if performance really is an issue, then optimise (and the easiest way to do that might be to rewrite the critical bits in C). – Adrian Ratnapala Jun 13 '13 at 15:16
  • Why do you want something faster? I doubt it's really affecting your program. – Lanaru Jun 13 '13 at 15:18
  • You could compile your regex before running, that would make it a bit faster. – Jonas Byström Jun 13 '13 at 15:18
  • 1
    See http://stackoverflow.com/questions/1546226/the-shortest-way-to-remove-multiple-spaces-in-a-string-in-python. The winner seems to be `while ' ' in s: s=s.replace(' ', ' ')` – Fredrik Pihl Jun 13 '13 at 15:19
  • @FredrikPihl If you still have time, suggest editing comment to link directly to answer: http://stackoverflow.com/a/15913564/1988505 – Wesley Baugh Jun 13 '13 at 15:22
  • Too long time passed, added here insted: http://stackoverflow.com/a/15913564/297323 – Fredrik Pihl Jun 13 '13 at 15:25

3 Answers3

2

If you want to really optimize stuff like this, use C, not python.

Try cython, that is pretty much Python syntax but fast as C.

Here is some stuff you can time:

import array
buf=array.array('c')
input="   a     bc    "
space=False
for c in input:
  if not space or not c == ' ': buf.append(c)
  space = (c == ' ')
buf.tostring()

Also try using cStringIO:

import cStringIO
buf=cStringIO.StringIO()
input="   a     bc    "
space=False
for c in input:
  if not space or not c == ' ': buf.write(c)
  space = (c == ' ')
buf.getvalue()

But again, if you want to make such things really fast, don't do it in python. Use cython. The two approaches I gave here will likely be slower, just because they put much more work on the python interpreter. If you want these things to be fast, do as little as possible in python. The for c in input loop likely already kills all theoretical performance of above approaches.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
2

FWIW, some timings

$  python -m timeit -s 's="   a     bc    "' 't=s[:]' "while '  ' in t: t=t.replace('  ', ' ')"
1000000 loops, best of 3: 1.05 usec per loop

$ python -m timeit -s 'import re;s="   a     bc    "'  "re.sub(' +', ' ', s)"
100000 loops, best of 3: 2.27 usec per loop

$ python -m timeit -s 's=" a bc "' "''.join((s[0],' '.join(s[1:-1].split()),s[-1]))"
1000000 loops, best of 3: 0.592 usec per loop

$ python -m timeit -s 'import re;s="   a     bc    "'  "re.sub(' {2,}', ' ', s)"
100000 loops, best of 3: 2.34 usec per loop

$ python -m timeit -s 's="   a     bc    "' '" "+" ".join(s.split())+" "'
1000000 loops, best of 3: 0.387 usec per loop
Fredrik Pihl
  • 44,604
  • 7
  • 83
  • 130
  • `re.sub(' {2,}', ...` would be a fairer test. There's no point in matching a single space. – Aya Jun 13 '13 at 15:27
  • @Aya -- Good suggestion, for me, that does about 30% better for this simple test. – mgilson Jun 13 '13 at 15:28
  • I also timed my suggestion ... It comes in between the other two on my desktop: `python -m timeit -s 's=" a bc "' "s = ''.join((s[0],' '.join(s[1:-1].split()),s[-1]))"` – mgilson Jun 13 '13 at 15:29
  • Using regex is always slower than using a direct approach. Only use regex when there is no other simpler way or if code readability(Although regex can sometime be complicated to understand) is more important than speed. – Samy Arous Jun 13 '13 at 15:37
  • 1
    @lcfseth It would depend on the length of the string, and the number of multi-space instances. For longer strings with many multi-space instances, the regex would out-perform the `str.replace` approach. – Aya Jun 13 '13 at 15:42
  • @Aya The replace approach is not nearly optimal (it's not even linear). But take the approach presented by Anony-Mousse. I might be wrong but there is no way a regex will out-perform it. – Samy Arous Jun 13 '13 at 15:55
  • @lcfseth For a sufficiently long string, I'd be willing to bet that `re.sub()` would beat either of Anony-Mousse's Python-based examples. – Aya Jun 13 '13 at 16:01
  • 1
    With this trivial string the `while`-approach beats the re even with `s = "..."*10000` – Fredrik Pihl Jun 13 '13 at 16:21
  • @JanneKarila - glad someone is paying attention :-) See update. Should propably write a proper timing comparison but I believe this question is rather dead now... – Fredrik Pihl Jun 14 '13 at 09:24
  • Or try the regexp: ` +` (i.e. two spaces). The third and last ones are not fair, because it "magically knows" there was a leading and a trailing space. I'd love to see the two approaches I suggested in your benchmark (although I don't really expect them to win - too much python code involved) – Has QUIT--Anony-Mousse Jun 14 '13 at 09:32
  • Oh, and try on a longer string. With the toy example from the question results may be inaccurate. For example the first method likely is the fastest when there are no duplicate spaces in the string. I'd love to see if 100x the toy string concatenated changes the results a lot. – Has QUIT--Anony-Mousse Jun 14 '13 at 09:36
  • I will, parental leave would be boring without SO :-) – Fredrik Pihl Jun 14 '13 at 09:38
0

Just a small rewrite of the suggestion up there, but just because something has a small fault doesn't mean you should assume it won't work.

You could easily do something like:

front_space = lambda x:x[0]==" "
trailing_space = lambda x:x[-1]==" "
" "*front_space(text)+' '.join(text.split())+" "*trailing_space(text)
Slater Victoroff
  • 21,376
  • 21
  • 85
  • 144