1

I have short strings like this

$str = 'abc | xx ??   "1 x \' 3" d e f \' y " 5 \' x yz';

I want to remove all spaces from a string that are not enclosed in single or double quotes. Any characters enclosed in single or double quotes should not be changed. As a result, I expect:

$expected =  'abc|xx??"1 x \' 3"def\' y " 5 \'xyz';

My current solution based on character-wise comparisons is the following:

function removeSpaces($string){
  $ret = $stop = "";
  for($i=0; $i < strlen($string);$i++){
    $char = $string[$i];
    if($stop == "") {
      if($char == " ") continue;
      if($char == "'" OR $char == '"') $stop = $char;
    }
    else {
      if($char == $stop) $stop = "";
    }
    $ret .= $char;
  }
  return $ret;
}

Is there a solution that is smarter?

Nigel Ren
  • 56,122
  • 11
  • 43
  • 55
jspit
  • 7,276
  • 1
  • 9
  • 17
  • `d e f \' y` is enclosed by `"`, what's the rule here? You find the first quote and then, if/when you find the next quote you start do remove whitespace? – Felippe Duarte Oct 23 '20 at 17:20
  • The handling is the same as with the PHP interpreter. Single quotes enclosed in double quotes and double quotes enclosed in single quotes are treated like any other characters. – jspit Oct 23 '20 at 17:55
  • @Thefourthbird Please never post answers as comments. – mickmackusa Oct 26 '20 at 02:50

3 Answers3

2

You can use

preg_replace('~(?<!\\\\)(?:\\\\{2})*(?:"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"|\'[^\'\\\\]*(?:\\\\.[^\'\\\\]*)*\')(*SKIP)(?!)|\s+~s', '', $str)

See the PHP demo and a regex demo.

Details

  • (?<!\\)(?:\\{2})* - a check if there is no escaping \ immediately on the left: any amount of double backslashes not preceded with \
  • (?:"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*') - either a double- or single-quoted string literal allowing escape sequences
  • (*SKIP)(?!) - skip the match and start a new search from the location where the regex failed
  • | - or
  • \s+ - 1 or more whitespaces.

Note that a backslash in a single-quoted PHP string literal is used to form string escape sequences, and thus a literal backslash is "coded" with the help of double backslashes, and to match a literal backslash in text, two such backslashes are required, hence "\\\\" is used.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Works perfectly for me. Was surprised that this is possible with a regular expression. Thank you very much. – jspit Oct 24 '20 at 16:22
1

You could capture either " or ' in a group and consume any escaped variants or each until encountering the closing matching ' or " using a backreference \1

(?<!\\)(['"])(?:(?!(?:\1|\\)).|\\.)*+\1(*SKIP)(*FAIL)|\h+

Regex demo | Php demo

Explanation

  • (?<!\\) Negative lookbehind, assert not a \ directly to the left
  • (['"]) capture group 1, match either ' or "
  • (?: Non capture group
    • (?!(?:\1|\\)). If what is not directly to the right is either the value in group 1 or a backslash, match any char except a newline
    • | Or
    • \\. Match an escaped character
  • )*+ Close non capture group and repeat 1+ times
  • \1 Backreference to what is captured in group 1 (match up either ' or ")
  • (*SKIP)(*FAIL) Skip the match until now. Read more about (*SKIP)(*FAIL)
  • | Or
  • \h+ Match 1+ horizontal whitespace chars that you want to remove

As @Wiktor Stribiżew points out in his comment

In some rare situations, this might match at a wrong position, namely, if there is a literal backslash (not an escaping one) before a single/double quoted string that should be skipped. You need to add (?:\{2})* after (?<!\)

The pattern would then be:

(?<!\\)(?:\\{2})*(['"])(?:(?!(?:\1|\\)).|\\.)*+\1(*SKIP)(*FAIL)|\h+

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • Works perfectly for me too. I accept this solution because the regular expression is a bit shorter. Thank you. – jspit Oct 24 '20 at 16:26
  • @jspit In some rare situations, this might match at a wrong position, namely, if there is a literal backslash (not an escaping one) before a single/double quoted string that should be skipped. You need to add `(?:\\{2})*` after `(?<!\\)` – Wiktor Stribiżew Oct 24 '20 at 17:12
  • @WiktorStribiżew That is a good point, in that case your pattern should be the accepted answer. +1 – The fourth bird Oct 24 '20 at 17:25
  • Both regular expressions fail with this string for me: "a\\'a b'" My string solution work fine for it. But I can rule out this rare situation for my application. Therefore I can use your solutions without any problems. – jspit Oct 24 '20 at 21:27
0

Here is a 3 step approach:

  1. replace spaces in quote sections with placeholder
  2. remove all spaces
  3. restore spaces in quote sections
    $str = 'abc | xx ??   "1 x \' 3" d e f \' y " 5 \' x yz';
    echo 'input:  ' . $str . "\n";
    $result = preg_replace_callback( // replace spaces in quote sections with placeholder
        '|(["\'])(.*?)(\1)|',
        function ($matches) {
            $s = preg_replace('/ /', "\x01", $matches[2]);
            return $matches[1] . $s . $matches[3];
        },
        $str
    );
    $result = preg_replace('/ /', '', $result);     // remove all spaces
    $result = preg_replace('/\x01/', ' ', $result); // restore spaces in quote sections
    echo 'result: ' . $result;
    echo "\nexpect: " . 'abc|xx??"1 x \' 3"def\' y " 5 \'xyz';

Output:

input:  abc | xx ??   "1 x ' 3" d e f ' y " 5 ' x yz
result: abc|xx??"1 x ' 3"def' y " 5 'xyz
expect: abc|xx??"1 x ' 3"def' y " 5 'xyz

Explanation:

  1. replace spaces in quote sections with placeholder
  • use a preg_replace_callback()
  • '|(["\'])(.*?)(\1)|' matches quote sections starting and ending with either " or '
  • the (\1) makes sure to match the closing quote (either " or ')
  • within the callback, use preg_replace() to replace all spaces with a non-printable replacement "\x01"
  1. remove all spaces
  • use preg_replace() to remove all spaces
  • the replace does not match the replacement "\x01", thus misses spaces in quote sections
  1. restore spaces in quote sections
  • use preg_replace() to restore all spaces from replacement "\x01"
mickmackusa
  • 43,625
  • 12
  • 83
  • 136
Peter Thoeny
  • 7,379
  • 1
  • 10
  • 20
  • I would not use a `preg_` that requires another `preg_` call to be nested inside it ...and also needs two more `preg_` calls to mop up at the end. In what way do you think your new answer adds value to this page? – mickmackusa Oct 24 '20 at 03:57
  • @mickmackusa: Because there is usually more than one answer to a question / I think it's good to give options. Let the reader decide what is useful and what is not. – Peter Thoeny Oct 24 '20 at 05:38
  • But if you already know that your answer is less efficient (because it makes multiple `preg_` calls and the earlier answers only make 1 `preg_` call), then your answer will only waste researchers' time while they read an answer that provides no benefit. So my question again to you is: What value does this less efficient, less direct, more verbose answer have over the other answers? Does it somehow provide improved accuracy? I don't know because I didn't test any of the answers, but I can virtually put money on Wiktor's answers being solid. – mickmackusa Oct 24 '20 at 05:45
  • @mickmackusa: I agree, that for this particular use case the one line regex is easier, and likely faster. Value for this answer: Learning and education. There are use cases where multi-pass regexes make sense to solve complex problems effectively, such as parsing nested structures: https://twiki.org/cgi-bin/view/Blog/BlogEntry201109x3 – Peter Thoeny Oct 25 '20 at 21:35