13

I have a file with has several spaces among the words at some point. I need to clean the file and replace existing multi-spaced sequences with one space only. I have written the following statement which does not work at all, and it seems I'm making a big mistake.

 $s = preg_replace("/( *)/", " ", $x);

My file is very simple. Here is a part of it:

Hjhajhashsh dwddd dddd sss   ddd wdd ddcdsefe xsddd   scdc yyy5ty    ewewdwdewde           wwwe ddr3r dce eggrg               vgrg fbjb   nnn  bh jfvddffv mnmb   weer ffer3ef f4r4 34t4 rt4t4t 4t4t4t4t    ffrr  rrr  ww w w ee3e iioi   hj   hmm  mmmmm mmjm lk ;’’ kjmm  ,,,, jjj hhh  lmmmlm m mmmm lklmm jlmm m
Burhan Khalid
  • 169,990
  • 18
  • 245
  • 284
Mostafa Talebi
  • 8,825
  • 16
  • 61
  • 105

3 Answers3

34

Your regex replaces any number of spaces (including zero) with a space. You should only replace two or more (after all, replacing a single space with itself is pointless):

$s = preg_replace("/ {2,}/", " ", $x);
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
3

What I usually do to clean up multiple spaces is:

while (strpos($x, '  ') !== false) {
   $x = str_replace('  ', ' ', $x);
}

Conditions/hypotheses:

  1. strings with multiple spaces are rare
  2. two spaces are by far more common than three or more
  3. preg_replace is expensive in terms of CPU
  4. copying characters to a new string should be avoided when possible

Of course, if condition #1 is not met, this approach does not make sense, but it usually is.

If #1 is met, but any of the others is not (this may depend on the data, the software (PHP version) or even the hardware), then the following may be faster:

if (strpos($x, '  ') !== false) {
   $x = preg_replace('/  +/', ' ', $x); // i.e.: '/␣␣+/'
}

Anyway, if multiple spaces appear only in, say, 2% of your strings, the important thing is the preventive check with strpos, and you probably don't care much about optimizing the remaining 2% of cases.

Walter Tross
  • 12,237
  • 2
  • 40
  • 64
  • PHP's regex engine is highly optimized. You should profile this - I'm willing to bet that this approach will be much slower than a single regex replace. – Tim Pietzcker Mar 05 '14 at 08:42
  • @TimPietzcker: if multiple spaces are rare enough you already lost your bet, because one call to `strpos` is for sure less expensive than one call to `preg_replace` – Walter Tross Mar 05 '14 at 08:44
  • Can you try it on the example string the OP gave? – Tim Pietzcker Mar 05 '14 at 08:47
  • @TimPietzcker, I'm quite sure that my loop is much slower than a single `preg_replace` on the OP's example, mainly because what dominates is the function call overhead, and with runs of 15 spaces, as in this case, the `strpos`is called 5 times and the `str_replace` 4 times. This example is absolutely not realistic, though. – Walter Tross Mar 05 '14 at 08:53
  • @WalterTross "This example is absolutely not realistic, though." - I'm interested to know in what way(s?) the example is _absolutely not realistic_? – Steven Mar 05 '14 at 09:20
  • @Steven: OK, bad wording - _absolutely uncommon_, or something like that. If it is common for the OP, then my hypothesis #1 is not met for him, and he can happily discard my answer. – Walter Tross Mar 05 '14 at 09:28
  • @WalterTross I see where you're coming from and for everyday/typical files (letters, essays, reports, etc.) I agree with you that multiple spaces are going to be rare and are likely to be typos when they do occur; given that tabs are more commonly used for spacing. However, if you were to go through my files (yours too, probably) you'd find an abundance of multiple spaces (debate: tabs vs spaces) most notably in source code... Even in your answer there are likely several instances of multiple spaces? – Steven Mar 05 '14 at 10:07
  • Single space normalization is usually done on strings that are not supposed to be input with multiple spaces – Walter Tross Mar 06 '14 at 10:44
  • 2
    I did some profiling of this solution vs. the preg_replace() solution from @TimPietzcker on a PHP 7.0 system. This solution is nearly identical in duration for strings with 1-2 spaces, but it takes about twice as long for more than 2 spaces. So the preg_replace() solution is preferred. – orrd Jan 25 '17 at 20:14
  • @orrd, this solution is based on the hypotheses stated above. I now clarified hypothesis #1, the most important one, by adding just 2 words: "strings with". In the limit where the probability of strings with multiple spaces tends to 0, my solution simply scans the whole string with a `strpos(' ')`, while the `preg_replace`-only solution certainly does more than that (and may also copy the string, but this is something I'm not sure of). Please note that this solution also has a `preg_replace` version, where the `preg_replace` is enclosed in a `strpos`. – Walter Tross Jan 25 '17 at 21:39
  • Sure, but surprisingly the preg_replace solution seemed to take almost exactly the same amount of time even if the string doesn't have any double spaces. I wouldn't expect that either, but it seemed to be the case. My test strings were fairly short, so I don't know if there would be a difference with a large block of text. – orrd Jan 25 '17 at 23:55
1
// Your input
$str = "Hjhajhashsh dwddd dddd sss   ddd wdd ddcdsefe xsddd   scdc yyy5ty    ewewdwdewde           wwwe ddr3r dce eggrg               vgrg fbjb   nnn  bh jfvddffv mnmb   weer ffer3ef f4r4 34t4 rt4t4t 4t4t4t4t    ffrr  rrr  ww w w ee3e iioi   hj   hmm  mmmmm mmjm lk ;’’ kjmm  ,,,, jjj hhh  lmmmlm m mmmm lklmm jlmm m";
        echo $str.'<br>'; 

        $output = preg_replace('!\s+!', ' ', $str); // Replace multispace with sigle.

        echo $output;
its_me
  • 10,998
  • 25
  • 82
  • 130
EKL
  • 143
  • 3
  • 13
  • can you explain what is the difference between `!\s+!` and `/\s+/` ? for me? i don't understand. looks like both do the same – ElTi-42 May 02 '22 at 14:01