2

Introduction

First my general issue is that I want to string replace question marks in a string, but only when they are not quoted. So I found a similar answer on SO (link) and began testing out the code. Unfortunately, of course, the code does not take into account escaped quotes.

For example: $string = 'hello="is it me your are looking for\\"?" AND test=?';

I have adapted a regular expression and code from that answer to the question: How to replace words outside double and single quotes, which is reproduced here for ease of reading my question:

<?php
function str_replace_outside_quotes($replace,$with,$string){
    $result = "";
    $outside = preg_split('/("[^"]*"|\'[^\']*\')/',$string,-1,PREG_SPLIT_DELIM_CAPTURE);
    while ($outside)
        $result .= str_replace($replace,$with,array_shift($outside)).array_shift($outside);
    return $result;
}
?>

Actual issue

So I have attempted to adjust the pattern to allow for it to match anything that is not a quote " and quotes that are escaped \":

<?php
$pattern = '/("(\\"|[^"])*"' . '|' . "'[^']*')/";

// when parsed/echoed by PHP the pattern evaluates to
// /("(\"|[^"])*"|'[^']*')/
?>

But this does not work as I had hoped.

My test string is: hello="is it me your are looking for\"?" AND test=?

And I am getting the following matches:

array
  0 => string 'hello=' (length=6)
  1 => string '"is it me your are looking for\"?"' (length=34)
  2 => string '?' (length=1)
  3 => string ' AND test=?' (length=11)

Match index two should not be there. That question mark should be considered part of match index 1 only and not repeated separately.

Once resolved this same fix should also correct the other side of the main alternation for single quotes/apostrophes as well '.

After this is parsed by the complete function it should output:

echo str_replace_outside_quotes('?', '%s', 'hello="is it me your are looking for\\"?" AND test=?');
// hello="is it me your are looking for\"?" AND test=%s

I hope that this makes sense and I have provided enough information to answer the question. If not I will happily provide whatever you need.

Debug code

My current (complete) code sample is on codepad for forking as well:

function str_replace_outside_quotes($replace, $with, $string){
    $result = '';
    var_dump($string);
    $pattern = '/("(\\"|[^"])*"' . '|' . "'[^']*')/";
    var_dump($pattern);
    $outside = preg_split($pattern, $string, -1, PREG_SPLIT_DELIM_CAPTURE);
    var_dump($outside);
    while ($outside) {
        $result .= str_replace($replace, $with, array_shift($outside)) . array_shift($outside);
    }
    return $result;
}
echo str_replace_outside_quotes('?', '%s', 'hello="is it me your are looking for\\"?" AND test=?');

Sample input and expected output

In: hello="is it me your are looking for\\"?" AND test=? AND hello='is it me your are looking for\\'?' AND test=? hello="is it me your are looking for\\"?" AND test=?' AND hello='is it me your are looking for\\'?' AND test=?
Out: hello="is it me your are looking for\\"?" AND test=%s AND hello='is it me your are looking for\\'?' AND test=%s hello="is it me your are looking for\\"?" AND test=%s AND hello='is it me your are looking for\\'?' AND test=%s

In: my_var = ? AND var_test = "phoned?" AND story = 'he said \'where is it?!?\''
Out: my_var = %s AND var_test = "phoned?" AND story = 'he said \'where is it?!?\''
Community
  • 1
  • 1
Treffynnon
  • 21,365
  • 6
  • 65
  • 98
  • why does this *have* to be done with regular expressions? why not just create a simple loop: `for($i=0;$i – cegfault Nov 13 '12 at 12:39
  • Short answer. It doesn't. Its just gotta work :) – Treffynnon Nov 13 '12 at 12:41
  • I think you should write a simple parser: read char by char and adjust accordingly. – Luca Rainone Nov 13 '12 at 12:48
  • @chumkiu that was my original plan, but had hoped this would be easier and quicker. – Treffynnon Nov 13 '12 at 12:52
  • Please see [my answer to a nearly identical question](http://stackoverflow.com/a/5696141/433790) – ridgerunner Nov 13 '12 at 16:21
  • @ridgerunner thanks for the link. I have had a quick look and I cannot see how to make this work when it can be either double or singly quoted. It seems to answer the question in isolation. Would you be able to reference your previous answer in an answer here? Would be great to cut down the complexity of the regex, but it is outside my current regex knowledge. – Treffynnon Nov 13 '12 at 16:34
  • Yes I'm working on an answer for you. Will your strings contain other commonly escaped characters such as `\n`, `\t`, etc? – ridgerunner Nov 13 '12 at 18:21
  • @ridgerunner Thanks for taking a look. Yes, it may contain other commonly escaped characters. It is for parametised SQL statements (with some hard coded values as well), which for reasons outside of my control I must process this way rather than passing into PDO for this particular application. – Treffynnon Nov 13 '12 at 18:25
  • It would really help if you provided some example input and expected output for the `str_replace_outside_quotes()` function. It is not clear what the function is supposed to do. I am assuming from its name that you want to replace all instances of: `$replace` with: `$with` that occur within `$string`, but only those instances that do NOT occur within single or double quoted substrings, (which themselves may contain escaped quotes). Yes? – ridgerunner Nov 13 '12 at 18:34
  • @ridgerunner that is exactly it. – Treffynnon Nov 13 '12 at 18:37
  • @ridgerunner I have edited my question with a set of sample input and sample output strings. – Treffynnon Nov 13 '12 at 18:41
  • Ok, I've thrown my solution into the ring. – ridgerunner Nov 14 '12 at 00:31
  • I also put new answer. It doesnt use regex(well it uses for replacing - I think it is better), but I think it easier to understand and it should work(if there is not some small error) with all your needs – Igor Nov 14 '12 at 12:38

5 Answers5

3

The following tested script first checks that a given string is valid, consisting solely of single quoted, double quoted and un-quoted chunks. The $re_valid regex performs this validation task. If the string is valid, it then parses the string one chunk at a time using preg_replace_callback() and the $re_parse regex. The callback function processes the unquoted chunks using preg_replace(), and returns all quoted chunks unaltered. The only tricky part of the logic is passing the $replace and $with argument values from the main function to the callback function. (Note that PHP procedural code makes this variable passing from the main function to the callback function a bit awkward.) Here is the script:

<?php // test.php Rev:20121113_1500
function str_replace_outside_quotes($replace, $with, $string){
    $re_valid = '/
        # Validate string having embedded quoted substrings.
        ^                           # Anchor to start of string.
        (?:                         # Zero or more string chunks.
          "[^"\\\\]*(?:\\\\.[^"\\\\]*)*"  # Either a double quoted chunk,
        | \'[^\'\\\\]*(?:\\\\.[^\'\\\\]*)*\'  # or a single quoted chunk,
        | [^\'"\\\\]+               # or an unquoted chunk (no escapes).
        )*                          # Zero or more string chunks.
        \z                          # Anchor to end of string.
        /sx';
    if (!preg_match($re_valid, $string)) // Exit if string is invalid.
        exit("Error! String not valid.");
    $re_parse = '/
        # Match one chunk of a valid string having embedded quoted substrings.
          (                         # Either $1: Quoted chunk.
            "[^"\\\\]*(?:\\\\.[^"\\\\]*)*"  # Either a double quoted chunk,
          | \'[^\'\\\\]*(?:\\\\.[^\'\\\\]*)*\'  # or a single quoted chunk.
          )                         # End $1: Quoted chunk.
        | ([^\'"\\\\]+)             # or $2: an unquoted chunk (no escapes).
        /sx';
    _cb(null, $replace, $with); // Pass args to callback func.
    return preg_replace_callback($re_parse, '_cb', $string);
}
function _cb($matches, $replace = null, $with = null) {
    // Only set local static vars on first call.
    static $_replace, $_with;
    if (!isset($matches)) { 
        $_replace = $replace;
        $_with = $with;
        return; // First call is done.
    }
    // Return quoted string chunks (in group $1) unaltered.
    if ($matches[1]) return $matches[1];
    // Process only unquoted chunks (in group $2).
    return preg_replace('/'. preg_quote($_replace, '/') .'/',
        $_with, $matches[2]);
}
$data = file_get_contents('testdata.txt');
$output = str_replace_outside_quotes('?', '%s', $data);
file_put_contents('testdata_out.txt', $output);
?>
ridgerunner
  • 33,777
  • 5
  • 57
  • 69
  • This looks like a very neat solution to the problem and I especially like the validation before operating on the string to prevent erroneous conversions. Given the complexity I am thinking that I will add a small check to the function so it can use `str_replace()` if no quote marks are found in the supplied string. Thank you for the commented explanations as well. Regular expressions are something that I need to improve upon as once they get more complex I start getting a bit lost so comments are very helpful. – Treffynnon Nov 14 '12 at 12:55
  • 1
    The only thing I thought of was that the `preg_quote` could be executed only once if it were moved inside the `if (!isset($matches)) {` statement. In this way it is set at "source" rather than re-evaluated each time a replacement is called. Despite the performance hit I have turned this into a little `MyString` class eg. `MyString::str_replace_outside_quotes($search, $replace, $subject);`. In this way I have actually also removed this memoisation altogether. – Treffynnon Nov 14 '12 at 15:27
  • @ridgerunner I have since [committed](https://github.com/j4mie/idiorm/commit/318d7cdd5ccd2d686cbd6915ff3f58486248c02c) this (attributed to you) into [Idiorm](https://github.com/j4mie/idiorm) and tagged it as the v1.2.0 release. Thanks for your help. Idiorm is a lightweight nearly-zero-configuration object-relational mapper and fluent query builder for PHP5. – Treffynnon Nov 14 '12 at 17:22
  • 2
    Not sure who downvoted this but an explanation would be appreciated. – ridgerunner Nov 14 '12 at 19:21
  • 2
    @ridgerunner I have my theories on who downvoted your answer. Not me. I've had similar experiences in several answers with one guy around here. – Carlos Nov 15 '12 at 07:36
  • @jackflash it seems I'm getting this, too ;). +1 @ridgerunner! – Martin Ender Nov 21 '12 at 12:49
  • Out of curiosity, is there an intentional reason why there's two different patterns accounting for double quoted vs single quoted sub-strings? Would this not work just as easily with less code? `(?['"]).*(?<!\\)\k{quote}(*SKIP)(*F)|\?` I'm deriving this from an [answer I posted in another question](https://stackoverflow.com/a/31949494/3257871). – Erutan409 Aug 31 '17 at 15:19
  • @Erutan409 - Yes, efficiency and accuracy. A complete answer to your question requires an in-depth understanding of how an NFA regex engine works. I highly recommend reading "Mastering Regular Expressions" by Jeffrey Friedl. Your regex has several problems - try running it on the following test string and you will see two of the ways your regex can fail: `$data = '?stuff? \'first single ? quoted string\' ?stuff? "double ? quoted string ending with an escaped-escape \\\\" \'second single ? quoted string\' ?stuff?';`. Note also that the length of a regex is not indicative of its speed. – ridgerunner Sep 03 '17 at 17:37
  • I appreciate your elaborate response and respect the technical content of it; albeit somewhat condescending. I have read that book and I am aware that shortness of an expression isn't necessarily indicative of its efficiency. Either way, there is redundancy in your expression that could be reduced a little with a minor performance hit for the application of what the OP was asking for. Thank you for explanation, though. – Erutan409 Sep 09 '17 at 19:04
  • @Erutan409 - I am sorry if I came across as being condescending - that was not my intention - (I am frequently guilty of that). Did you test your regex on the test string I provided? That string illustrates two errors in your regex with regards to _accuracy_ (its not just about efficiency). Did you figure out the errors? The 1st is that the greedy `*.` does not correctly match the closing quote when there are more than one quoted strings in the subject string. The second is that the negative lookbehind does not work when the last char in a quoted string is an escaped escape (`\\`). – ridgerunner Sep 10 '17 at 11:25
2

This regex matches valid quoted strings. This means it is aware of escaped quotes.

^("[^\"\\]*(?:\\.[^\"\\]*)*(?![^\\]\\)")|('[^\'\\]*(?:\\.[^\'\\]*)*(?![^\\]\\)')$

Ready for PHP use:

$pattern = '/^((?:"([^"\\\\]*(?:\\\\.[^"\\\\]*)*(?![^\\\\]\\\\))")|(?:\'([^\'\\\\]*(?:\\\\.[^\'\\\\]*)*(?![^\\\\]\\\\))\'))$/';

Adapted for str_replace_outside_quotes():

$pattern = '/((?:"(?:[^"\\\\]*(?:\\\\.[^"\\\\]*)*(?![^\\\\]\\\\))")|(?:\'(?:[^\'\\\\]*(?:\\\\.[^\'\\\\]*)*(?![^\\\\]\\\\))\'))/';
Carlos
  • 4,949
  • 2
  • 20
  • 37
  • @Treffynnon you'll need to add some backslashes in order to get it working wrapped in quotes. Let me do that for you – Carlos Nov 13 '12 at 12:48
  • 1
    This regex experiences [catastrophic backtracking](http://www.regular-expressions.info/catastrophic.html) when confronted with certain subject strings which do not match, e.g. `"this string does not match\"`. – ridgerunner Nov 13 '12 at 18:16
  • @ridgerunner That string is not a valid quoted string because it has not ending quote (it is escaped), thus it is a quoted string up to the infinite. That's the reason why the regex doesn't match. – Carlos Nov 14 '12 at 07:36
  • 1
    Thats right, its invalid. But a robust regex solution should complete its job quickly (either matching or not-matching) on _any string_. Your regex _almost_ correctly implement's [Friedl's](http://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124 "Mastering Regular Expressions") `{delim normal* (special normal*)* delim}` construct, but it makes the `special` part optional, which is a no-no. (Hint: change both the: `(?:\\.)*` expressions to just: `\\.` and you will be on your way to fixing this.) – ridgerunner Nov 14 '12 at 15:32
  • Not quite. To correctly implement the _"Unrolling-the-Loop"_ construct mentioned, the "special" subexpression needs to atomically match at least one character, and must not be able to match at the same position as the "normal" subexpression. In your regex the "special" subexpression, `[^"\\]*`, with its `*` quantifier, can match zero characters. See: [Mastering Regular Expressions (3rd Edition)](http://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124 "By Jeffrey Friedl. Best book on Regex - ever!") for all the details on implementing this efficiency technique. – ridgerunner Nov 15 '12 at 16:29
1

» Code has been updated to solve ALL issues brought in comments and is now working properly «


Having $s an input, $p a phrase string and $v a replacement variable, use preg_replace as follows:

$r = '/\G((?:(?:[^\x5C"\']|\x5C(?!["\'])|\x5C["\'])*?(?:\'(?:[^\x5C\']|\x5C(?!\')' .
     '|\x5C\')*\')*(?:"(?:[^\x5C"]|\x5C(?!")|\x5C")*")*)*?)' . preg_quote($p) . '/';
$s = preg_match($r, $s) ? preg_replace($r, "$1" . $v, $s) : $s;

Check this demo.


Note: In regex, \x5C represents a \ character.

Ωmega
  • 42,614
  • 34
  • 134
  • 203
  • This works well for double quotes, but not at all for single quotes: http://ideone.com/l4el1U – Treffynnon Nov 13 '12 at 18:38
  • Thanks for that. I guess the only thing is that this regex only replaces `?` and doesn't allow the search term be specified at run time like the original function allows for. Using [`preg_quote()`](http://www.php.net/manual/en/function.preg-quote.php) won't work here because the search string appears in character classes. – Treffynnon Nov 13 '12 at 21:23
  • 1
    This solution destroys the string if there is no match. – ridgerunner Nov 14 '12 at 00:29
  • 1
    Erroneously converts `stuff "?" stuff` to `stuff "%s" stuff`. – ridgerunner Nov 14 '12 at 01:42
  • **UPDATE:** Answer has been updated with solution that solve all issues in above comments. Code and demo is now **working properly**. – Ωmega Nov 14 '12 at 18:12
-1

Edit, changed answer. Does not works with regex(only what is now regex - I thought it would be better to use preg_replace instead of str_replace, but you can change that)):

function replace_special($what, $with, $str) {
   $res = '';
   $currPos = 0;
   $doWork = true;

   while (true) {
     $doWork = false; //pesimistic approach

     $pos = get_quote_pos($str, $currPos, $quoteType);
     if ($pos !== false) {
       $posEnd = get_specific_quote_pos($str, $quoteType, $pos + 1);
       if ($posEnd !== false) {
           $doWork = $posEnd !== strlen($str) - 1; //do not break if not end of string reached

           $res .= preg_replace($what, $with, 
                                substr($str, $currPos, $pos - $currPos));
           $res .= substr($str, $pos, $posEnd - $pos + 1);                      

           $currPos = $posEnd + 1;
       }
     }

     if (!$doWork) {
        $res .= preg_replace($what, $with, 
                             substr($str, $currPos, strlen($str) - $currPos + 1));
        break;
     }

   }   

   return $res;
}

function get_quote_pos($str, $currPos, &$type) {
   $pos1 = get_specific_quote_pos($str, '"', $currPos);
   $pos2 = get_specific_quote_pos($str, "'", $currPos);
   if ($pos1 !== false) {
      if ($pos2 !== false && $pos1 > $pos2) {
        $type = "'";
        return $pos2;
      }
      $type = '"';
      return $pos1;
   }
   else if ($pos2 !== false) {
      $type = "'";
      return $pos2;
   }

   return false;
}

function get_specific_quote_pos($str, $type, $currPos) {
   $pos = $currPos - 1; //because $fromPos = $pos + 1 and initial $fromPos must be currPos
   do {
     $fromPos = $pos + 1;
     $pos = strpos($str, $type, $fromPos);
   }
   //iterate again if quote is escaped!
   while ($pos !== false && $pos > $currPos && $str[$pos-1] == '\\');
   return $pos;
}

Example:

   $str = 'hello ? ="is it me your are looking for\\"?" AND mist="???" WHERE test=? AND dzo=?';
   echo replace_special('/\?/', '#', $str);

returns

hello # ="is it me your are looking for\"?" AND mist="???" WHERE test=# AND dzo=#

----

--old answer (I live it here because it does solve something although not full question)

<?php
function str_replace_outside_quotes($replace, $with, $string){
    $result = '';
    var_dump($string);
    $pattern = '/(?<!\\\\)"/';
    $outside = preg_split($pattern, $string, -1, PREG_SPLIT_DELIM_CAPTURE);
   var_dump($outside);
    for ($i = 0; $i < count($outside); ++$i) {
       $replaced = str_replace($replace, $with, $outside[$i]);
       if ($i != 0 && $i != count($outside) - 1) { //first and last are not inside quote
          $replaced = '"'.$replaced.'"';
       }
       $result .= $replaced;
    }
   return $result;
}
echo str_replace_outside_quotes('?', '%s', 'hello="is it me your are looking for\\"?" AND test=?');
Igor
  • 1,835
  • 17
  • 15
  • This looks nice, but I am getting the quotes stripped out of the output: `hello=is it me your are looking for\"? AND test=%s` – Treffynnon Nov 13 '12 at 13:33
  • I am not sure how this will work with strings that are quoted with apostrophes either `'`. – Treffynnon Nov 13 '12 at 13:35
  • I made change, so it will work for ", but yes it wont work with single quote. I didnt realize it's requirement also sorry :) – Igor Nov 13 '12 at 13:41
-1

As @ridgerunner mentions in the comments on the question there is another possible regex solution:

function str_replace_outside_quotes($replace, $with, $string){
    $result = '';
    $pattern = '/("[^"\\\\]*(?:\\\\.[^"\\\\]*)*")' // hunt down unescaped double quotes
             . "|('[^'\\\\]*(?:\\\\.[^'\\\\]*)*')/s"; // or single quotes
    $outside = array_filter(preg_split($pattern, $string, -1, PREG_SPLIT_DELIM_CAPTURE));
    while ($outside) {
        $result .= str_replace($replace, $with, array_shift($outside)) // outside quotes
                .  array_shift($outside); // inside quotes
    }
    return $result;
}

Note the use of array_filter to remove some matches that were coming back from the regex empty and breaking the alternating nature of this function.


A no regex approach that I knocked up quickly. It works, but I am sure there are some optimisations that could be done.

function str_replace_outside_quotes($replace, $with, $string){
    $string = str_split($string);
    $accumulation = '';
    $current_unquoted_string = null;
    $inside_quote = false;
    $quotes = array("'", '"');
    foreach($string as $char) {
        if ($char == $inside_quote && "\\" != substr($accumulation, -1)) {
            $inside_quote = false;
        } else if(false === $inside_quote && in_array($char, $quotes)) {
            $inside_quote = $char;
        }

        if(false === $inside_quote) {
            $current_unquoted_string .= $char;
        } else {
            if(null !== $current_unquoted_string) {
                $accumulation .= str_replace($replace, $with, $current_unquoted_string);
                $current_unquoted_string = null;
            }
            $accumulation .= $char;
        }
    }
    if(null !== $current_unquoted_string) {
        $accumulation .= str_replace($replace, $with, $current_unquoted_string);
        $current_unquoted_string = null;
    }
    return $accumulation;
}

In my benchmarking it takes double the time of the regex approach above and when the string length is increased the regex options resource use doesn't go up by much. The approach above on the other hand increases linearly with the length of text fed to it.

casperOne
  • 73,706
  • 19
  • 184
  • 253
Treffynnon
  • 21,365
  • 6
  • 65
  • 98
  • @ridgerunner I have attempted to merge your solution here. Perhaps you could give some insight as to why the regex is returning some empty matches? – Treffynnon Nov 13 '12 at 17:21
  • It has to do with the fact that `preg_split` returns more than one array element per match if there are more than one capture group. Instead of using `array_filter()` the `PREG_SPLIT_NO_EMPTY` flag can be added to the `preg_split` call. – ridgerunner Nov 14 '12 at 02:23
  • Also, if the string starts or ends with a quoted chunk, `preg_split` will return an empty array element from the empty end. This alternating algorithm (from the OP) works only when the target string has alternating quoted and unquoted parts and will fail when two quoted chunks are adjacent or if the string starts with a quoted chunk. (p.s. I'm not sure who downvoted this - I didn't - this does work given a properly formatted input string but is not very robust.) – ridgerunner Nov 14 '12 at 02:30
  • @ridgerunner Yeah someone came through and basically downvoted everything in this question last night some time. All of a sudden every answer and the question itself got a -1 without any kind of explanation. It is a disappointing aspect of elements of SO to be honest. – Treffynnon Nov 14 '12 at 21:25