11

Having a single-quoted string:

$content = '\tThis variable is not set by me.\nCannot do anything about it.\n';

I would like to inerpret/process the string as if it was double-quoted. In other words I would like to replace all the possible escape characters (not just tab and linefeed as in this example) with the real values, taking into account that backslash might be escaped as well, thus '\\n' needs to be replaced by '\n'. eval() would easily do what I need but I cannot use it.

Is there some simple solution?

(A similar thread that I found deals with expansion of variables in the single-quoted string while I'm after replacing escape characters.)

Community
  • 1
  • 1
tmt
  • 7,611
  • 4
  • 32
  • 46
  • 1
    hakre, the concrete problem is described. The only solution that came to my mind was the use of eval() which is in my case unaccaptable due to security reasons. – tmt Nov 29 '11 at 11:24
  • Why you don't put them into double quotes instead? – KingCrunch Nov 29 '11 at 11:40
  • @cascaval: Normally folks here expect that you add your code so that you show the problem you've running into while solving the issue. I guess you're just asking for code, I'll add an answer. – hakre Nov 29 '11 at 11:50
  • @KingCrunch: The string is a variable passed to my method in a CMS. I have no control over it. – tmt Nov 29 '11 at 12:02
  • @hakre: I haven't supplied any code as I was thinking that I might have missed some simple solution which actually doesn't consist of writing custom code that would parse the string and deal with all the possible replacements. – tmt Nov 29 '11 at 12:07
  • @cascaval: There is one function that comes close to it but is not identical, [I write about it in my answer](http://stackoverflow.com/a/8311920/367456), which shows an alternative to it as well. – hakre Nov 29 '11 at 14:36

3 Answers3

7

There is a very simple way to do this, based on preg_replaceDoc and stripcslashes, both build in:

preg_replace_callback(
    '/\\\\([nrtvf\\\\$"]|[0-7]{1,3}|\x[0-9A-Fa-f]{1,2})/',
    fn($matches) => stripcslashes($matches[0]), $content
);

This works as long as "\\n" should become "\n" and the like. Demo

If you're looking for processing these strings literally, see my previous answer.

Edit: You asked in a comment:

I'm just a bit puzzled what's the difference between the output of this and stripcslashes() directly [?]

The difference is not always visible, but there is one: stripcslashes will remove the \ chracter if no escape sequence follows. In PHP strings, the slash is not be dropped in that case. An example, "\d", d is not a special character, so PHP preserves the slash:

$content = '\d';
$content; # \d
stripcslashes($content); # d
preg_replace(..., $content); # \d

That's why preg_replace is useful here, it will only apply the function on those substrings where stripcslashes works as intended: all valid escape sequences.


After a couple of years the answer is updated for PHP 7.4+.

The original answer did contain a Demo with using the e (eval) modifier in the regex. For (mostly good) reasons it has been removed from PHP and refuses to work spilling an error like:

PHP Warning: preg_replace(): The /e modifier is no longer supported, use preg_replace_callback

In case the new version gives syntax errors (e.g. PHP < 7.4) or because of preferrence, replace the arrow function with an anonymous one like:

static function (array $matches): string {
    return stripcslashes($matches[0]);
}

Please see Replace preg_replace() e modifier with preg_replace_callback for more on-site Q&A resources on the topic to replace the e modifier in general, it was deprecated in PHP 5.5.0:

[The] e (PREG_REPLACE_EVAL) [...] was DEPRECATED in PHP 5.5.0 (Jun 2013), and REMOVED as of PHP 7.0.0 (Dec 2015).

from the PHP manual

hakre
  • 193,403
  • 52
  • 435
  • 836
  • Works great and is really simple. I'm just a bit puzzled what's the difference between the output of this and stripcslashes() directly. – tmt Nov 29 '11 at 17:35
  • Alright, reading the man for the `stripcslashes` didn't give me any clue that a backslash is removed when not followed by an escape sequence. Thank you very much for your valuable input and the time you've spent on it. I appreciate it! – tmt Nov 30 '11 at 10:52
  • 1
    The /e modifier has been deprecated, you can still use this very same answer in the following way: ` preg_replace_callback( '#\\\\([nrtvf\\\\$"]|[0-7]{1,3}|\x[0-9A-Fa-f]{1,2})#', static function($value){ return stripcslashes($value[0]); }, $value ) ` – ln -s Mar 01 '21 at 17:35
  • 1
    @兜甲児: Yes, this answer was quite old, thanks for the hint. I updated it and left some more references about the error it gave for nowadays. Found the [arrow functions](https://www.php.net/manual/en/functions.arrow.php) particularly nice in this context. – hakre Mar 03 '21 at 07:34
6

If you need to do the exact escape sequences like PHP does, you need the long version, which is the DoubleQuoted class. I extended input string a bit to cover more escape sequences than in your question to make this more generic:

$content = '\\\\t\tThis variable\\string is\x20not\40set by me.\nCannot \do anything about it.\n';

$dq = new DoubleQuoted($content);

echo $dq;

Output:

\\t This variable\string is not set by me.
Cannot \do anything about it.

However, if you're okay to come closely to that, there is a PHP function called stripcslashes, for comparison, I've added the result of it and the PHP double-quote string:

echo stripcslashes($content), "\n";

$compare = "\\\\t\tThis variable\\string is\x20not\40set by me.\nCannot \do anything about it.\n";

echo $compare, "\n";

Output:

\t  This variablestring is not set by me.
Cannot do anything about it.

\\t This variable\string is not set by me.
Cannot \do anything about it.

As you can see stripcslashes drops some characters here compared to PHP native output.

(Edit: See my other answer as well which offers something simple and sweet with cstripslashes and preg_replace.)

If stripcslashes is not suitable, there is DoubleQuoted. It's constructor takes a string that is treated like a double quoted string (minus variable substitution, only the character escape sequences).

As the manual outlines, there are multiple escape sequences. They look like regular expressions, and all start with \, so it's looks near to actually use regular expressions to replace them.

However there is one exception: \\ will skip the escape sequence. The regular expression would need to have backtracking and/or atomic groups to deal with that and I'm not fluent with those so I just did a simple trick: I only applied the regular expressions to those parts of the string which do not contain \\ by simply exploding the string first and then imploding it again.

The two regular expression based replace functions, preg_replaceDoc and preg_replace_callbackDoc, allow to operate on arrays as well, so this is quite easy to do.

It's done in the __toString()Doc function:

class DoubleQuoted
{
    ...
    private $string;
    public function __construct($string)
    {
        $this->string = $string;
    }
    ...
    public function __toString()
    {
        $this->exception = NULL;
        $patterns = $this->getPatterns();
        $callback = $this->getCallback();
        $parts = explode('\\\\', $this->string);
        try
        {
            $parts = preg_replace_callback($patterns, $callback, $parts);
        }
        catch(Exception $e)
        {
            $this->exception = $e;
            return FALSE; # provoke exception
        }
        return implode('\\\\', $parts);
    }
    ...

See the explodeDoc and implodeDoc calls. Those take care that preg_replace_callback does not operate on any string that contains \\. So the replace operation has been freed from the burden to deal with these special cases. This is the callback function which is invoked by preg_replace_callback for each pattern match. I wrapped it into a closure so it is not publicly accessible:

private function getCallback()
{   
    $map = $this->map;
    return function($matches) use ($map)
    {
        list($full, $type, $number) = $matches += array('', NULL, NULL);

        if (NULL === $type)
            throw new UnexpectedValueException(sprintf('Match was %s', $full))
            ;

        if (NULL === $number)
            return isset($map[$type]) ? $map[$type] : '\\'.$type
            ;

        switch($type)
        {
            case 'x': return chr(hexdec($number));
            case '': return chr(octdec($number));
            default:
                throw  new UnexpectedValueException(sprintf('Match was %s', $full));
        }   
    };
}

You need some additional information to understand it as this is not the complete class already. I go through the missing points and add the missing code as well:

All patterns the class "looks for" contain subgroups, at least one. That one goes into $type and is either the single character to be translated or an empty string for octals and an x for hexadecimal numbers.

The optional second group $number is either not set (NULL) or contains the octal/hexadecimal number. The $matches input is normalized to the just named variables in this line:

list($full, $type, $number) = $matches += array('', NULL, NULL);

Patterns are defined upfront as sequences in a private member variable:

private $sequences = array(
    '(n|r|t|v|f|\\$|")', # single escape characters
    '()([0-7]{1,3})', # octal
    '(x)([0-9A-Fa-f]{1,2})', # hex
);

The getPatterns() function just wraps those definitions into valid PCRE regular expressions like:

/\\(n|r|t|v|f|\$|")/ # single escape characters
/\\()([0-7]{1,3})/ # octal
/\\(x)([0-9A-Fa-f]{1,2})/ # hex

It is pretty simple:

private function getPatterns()
{
    foreach($this->sequences as $sequence)
        $patterns[] = sprintf('/\\\\%s/', $sequence)
        ;

    return $patterns;
}

Now as the patterns are outlined, this explains what $matches contain when the callback function is invoked.

The other thing you need to know to understand how the callback works is $map. That's just an array containing the single replacement characters:

private $map = array(
    'n' => "\n",
    'r' => "\r",
    't' => "\t",
    'v' => "\v",
    'f' => "\f",
    '$' => '$',
    '"' => '"',
);

And that's already pretty much it for the class. There is another private variable $this->exception that is used to store if an exception has been thrown as __toString() can not throw exceptions and would lead to a fatal error if it would happen in the callback function. So it's caught and stored to a private class variable, here again that part of the code:

    ...
    public function __toString()
    {
        $this->exception = NULL;
        ...
        try
        {
            $parts = preg_replace_callback($patterns, $callback, $parts);
        }
        catch(Exception $e)
        {
            $this->exception = $e;
            return FALSE; # provoke exception
        }
        ...

In case of an exception while replacing, the function exists with FALSE which will lead to a catchable exception. A getter function makes the internal exception available then:

private $exception;
...
public function getException()
{
    return $this->exception;
}

As it's nice to access the original string as well, you can add another getter to obtain that:

public function getString()
{
    return $this->string;
}

And that's the whole class. Hope this is helpful.

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
  • You have really gone to great length with your answer. Thank you very much! While stripcslashes() apparently does what I needed, I'll sure play with the code provided as well. – tmt Nov 29 '11 at 16:56
  • Non-literal processing is much shorter, I've added [another answer](http://stackoverflow.com/a/8314506/367456). – hakre Nov 29 '11 at 16:56
0

A regex-based solution would probably be most maintainable here (the definitions of valid escape sequences in strings are even provided as regexes in the documentation):

$content = '\tThis variable is not set by me.\nCannot do anything about it.\n';

$replaced = preg_replace_callback(
                '/\\\\(\\\\|n|r|t|v|f|"|[0-7]{1,3}|\x[0-9A-Fa-f]{1,2})/',
                'replacer',
                $content);

var_dump($replaced);

function replacer($match) {
    $map = array(
        '\\\\' => "\\",
        '\\n' => "\n",
        '\\r' => "\r",
        '\\t' => "\t",
        '\\v' => "\v",
        // etc for \f \$ \"
    );

    $match = $match[0]; // So that $match is a scalar, the full matched pattern

    if (!empty($map[$match])) {
        return $map[$match];
    }

    // Otherwise it's octal or hex notation
    if ($match[1] == 'x') {
        return chr(hexdec(substr($match, 2)));
    }
    else {
        return chr(octdec(substr($match, 1)));
    }
}

The above can also (and really should) be improved:

  • Package the replacer function as an anonymous function instead
  • Possibly replace $map with a switch for a free performance increase
Jon
  • 428,835
  • 81
  • 738
  • 806
  • This does not work for `\\` (e.g. at the beginning of the string, like `\\t` additionally, non-escape characters should not escape instead of octdec'ing them. – hakre Nov 29 '11 at 12:23
  • @hakre: I 'm not sure what you mean does not work for `\\t`. It does, but you have to specify `\\\\t` inside the string literal to get `\\t` inside the string (in which case you get `\t` as output). As for the "default `octdec`" bug, on second thought it doesn't exist because the regex won't match such sequences in the first place (removed invalid assumption from answer). – Jon Nov 29 '11 at 12:58
  • `$t = "\\t";` will result in `\t` in PHP. That's [not covered with your replaces](http://codepad.org/N8gjoPPF). I wondered which regex could do that with `preg_replace` but I'm not a regex guru so [I divided the problem](http://stackoverflow.com/a/8311920/367456) instead. – hakre Nov 29 '11 at 13:58
  • @hakre: If **you write** `$content = '\t'` then the variable contains `\t` (two characters). If you write `$content = '\\t'` then *again* the variable contains `\t` (two characters), which is not a problem in my code (in both cases, it will return a string with one tab character) but a result of how literals are parsed. I know that you know all this, but state it again because I think we simply have interpreted the question differently. – Jon Nov 29 '11 at 14:12
  • please compare against the double quoted original: http://codepad.org/3RwxZP3U or: http://codepad.org/WbLjnLls or even more clear: http://codepad.org/H2NE4cme – hakre Nov 29 '11 at 14:25
  • @hakre: That's what I 'm trying to say: I am assuming `$content` contains what would go between double quotes without it being subject to parsing at all (*"this variable is not set by me"* -- does not sound to me like it's being set to string *literal*). All your examples subject a *literal* to single-quote parsing rules, which invalidates the assumption and therefore produces "wrong" results. I don't know how else to put it. – Jon Nov 29 '11 at 15:53
  • @Jon: It seems that your code does exactly what I've been looking for. However, apparently so does stripcslashes(). I really appreciate your input. – tmt Nov 29 '11 at 16:47
  • @cascaval Stripcslashes doesn't exactly do it, however, I've added [another answer](http://stackoverflow.com/a/8314506/367456) which I think does what you're looking for. It's a combination of regex + stripcslashes. – hakre Nov 29 '11 at 16:57