35

I looked through related questions before posting this and I couldn't modify any relevant answers to work with my method (not good at regex).

Basically, here are my existing lines:

$code = preg_replace_callback( '/"(.*?)"/', array( &$this, '_getPHPString' ), $code );

$code = preg_replace_callback( "#'(.*?)'#", array( &$this, '_getPHPString' ), $code );

They both match strings contained between '' and "". I need the regex to ignore escaped quotes contained between themselves. So data between '' will ignore \' and data between "" will ignore \".

Any help would be greatly appreciated.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Nahydrin
  • 13,197
  • 12
  • 59
  • 101
  • Do you need to be able to handle escaped slashes as well? In other words should it assume that any quote preceded by a slash is escaped, even if that slash is itself preceded by a slash? –  Apr 17 '11 at 17:57
  • @Phoenix, if you are referring to `\\"` and `\\'`, then no I do not. – Nahydrin Apr 17 '11 at 18:00
  • 1
    if you don't handle escaping the escape character, then escaping a particular character is invalid. –  Apr 17 '11 at 18:17

6 Answers6

90

For most strings, you need to allow escaped anything (not just escaped quotes). e.g. you most likely need to allow escaped characters like "\n" and "\t" and of course, the escaped-escape: "\\".

This is a frequently asked question, and one which was solved (and optimized) long ago. Jeffrey Friedl covers this question in depth (as an example) in his classic work: Mastering Regular Expressions (3rd Edition). Here is the regex you are looking for:

Good:

"([^"\\]|\\.)*"
Version 1: Works correctly but is not terribly efficient.

Better:

"([^"\\]++|\\.)*" or "((?>[^"\\]+)|\\.)*"
Version 2: More efficient if you have possessive quantifiers or atomic groups (See: sin's correct answer which uses the atomic group method).

Best:

"[^"\\]*(?:\\.[^"\\]*)*"
Version 3: More efficient still. Implements Friedl's: "unrolling-the-loop" technique. Does not require possessive or atomic groups (i.e. this can be used in Javascript and other less-featured regex engines.)

Here are the recommended regexes in PHP syntax for both double and single quoted sub-strings:

$re_dq = '/"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"/s';
$re_sq = "/'[^'\\\\]*(?:\\\\.[^'\\\\]*)*'/s";
ridgerunner
  • 33,777
  • 5
  • 57
  • 69
  • 2
    +1 for this `"[^"\\]*(?:\\.[^"\\]*)*"` avoiding alternation and benching better than `"(\\.|[^"\\]+)*"` –  Apr 17 '11 at 21:51
  • If I want to include a "possible" @ sign in front of the double quotes (C#), how would I do that? I tried using group and class, but to no avail. – Nahydrin Apr 17 '11 at 22:27
  • 1
    @Brian Graham - To add an optional @ in front of the expression, just add an `@?` in front of the leading quote. However, it is not as simple as that. With C# `@"..."` strings, an embedded quote is NOT `\"` (escaped with a backslash) but is rather `""` (two quotes in a row). In this case the expression you want is: `@"[^"]*(""[^"]*)*"`. – ridgerunner Nov 05 '12 at 19:35
  • +1 for excellent overview of options to solve this common problem. :) – zx81 May 18 '14 at 03:25
  • @Brian Graham - In my prior comment I suggested that the regex to match a `@"..."` C# string is: `@"[^"]*(""[^"]*)*"` and this is correct. However, to encode this regex in C# the regex string would be written: `@"@""[^""]*(""""[^""]*)*"""`. – ridgerunner Jul 23 '15 at 23:36
  • Was wondering whether to create a new question or not... decided to write as comment. If I want to match either double, or single quotes, I can do it like this quite easily: `/'([^'\\]*(?:\\.[^'\\]*)*)'|"([^"\\]*(?:\\.[^"\\]*)*)"/`. However I wonder if there's a nice trick to make this shorter, perhaps by remembering which quote was the first quote to match somehow? Edit: in JavaScript <3 – Stephan Bijzitter Dec 08 '15 at 11:06
  • Option three (with an added inner capture group) in action: [`"([^"\\]*(?:\\.[^"\\]*)*)"`](https://regex101.com/r/uHyGQ7/1) – wp78de Oct 21 '19 at 15:16
  • If I read Friedl correctly, he implies that your option marked "Better" this would be one of the best approaches if possessive quantifiers are supported, but he seems to indicate that he prefer using _two_ possessive quantifiers: `"([^"\\]++|\\.)*+"`. Am I interpreting him correctly, and what are your thoughts on this? – Garret Wilson Jan 20 '20 at 04:35
  • how could you modify this to split by spaces EXCEPT things that are between non-scaped quotes? This is what I got (works): (?=\S)[^"\s]*(?:"[^\\"]*(?:\\[\s\S][^\\"]*)*"[^"\s]*)* – DGoiko Dec 23 '20 at 22:03
10

Try a regex like this:

'/"(\\\\[\\\\"]|[^\\\\"])*"/'

A (short) explanation:

"                 # match a `"`
(                 # open group 1
  \\\\[\\\\"]     #   match either `\\` or `\"`
  |               #   OR
  [^\\\\"]        #   match any char other than `\` and `"`
)*                # close group 1, and repeat it zero or more times
"                 # match a `"`

The following snippet:

<?php
$text = 'abc "string \\\\ \\" literal" def';
preg_match_all('/"(\\\\[\\\\"]|[^\\\\"])*"/', $text, $matches);
echo $text . "\n";
print_r($matches);
?>

produces:

abc "string \\ \" literal" def
Array
(
    [0] => Array
        (
            [0] => "string \\ \" literal"
        )

    [1] => Array
        (
            [0] => l
        )

)

as you can see on Ideone.

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • I took your example and can't seem to get it working. Direct copy paste doesn't work, I also tried editing it to no avail. – Nahydrin Apr 17 '11 at 18:18
  • @Dark Slipstream, copy pasting (without altering!) the snippet did not wortk? I find that hard to believe. What PHP version are you using? Have you tried the Ideone link? – Bart Kiers Apr 17 '11 at 18:20
  • Thanks for your help Bart, I managed to get it working after a tweak to the beginning of the string. – Nahydrin Apr 17 '11 at 18:32
  • Does not match: `"String with a linefeed\n"` – ridgerunner Apr 17 '11 at 20:25
2

This has possibilities:

/"(?>(?:(?>[^"\\]+)|\\.)*)"/

/'(?>(?:(?>[^'\\]+)|\\.)*)'/

  • It works perfectly with Perl `my ($new_str) = ($str =~ /'((?>(?:(?>[^'\\]+)|\\.)*))'/);` - eg. `'#10's footer'` will give `#10`, `'#10\'s footer'` give `#10\'s footer`, `'10\\\'s footer'` give `10\\\'s footer` etc. – Diblo Dk Sep 10 '19 at 10:56
1

This will leave the quotes outside

(?<=['"])(.*?)(?=["'])

and use global /g will match all groups

danielpopa
  • 810
  • 14
  • 27
1

This seems to be as fast as the unrolled loop, based on some cursory benchmarks, but is much easier to read and understand. It doesn't require any backtracking in the first place.

"[^"\\]*(\\.[^"\\]*)*"
Andrew Traviss
  • 249
  • 1
  • 7
  • This **is** the unrolled loop. It's exactly the same as the third regex in [ridgerunner's answer](http://stackoverflow.com/a/5696141/20938), except you used a capturing group (making it slightly less efficient). – Alan Moore Sep 07 '13 at 12:10
  • Hmm, I did benchmark it before posting and I wasn't able to produce a consistent difference in speed between the two. – Andrew Traviss Dec 03 '13 at 16:37
  • 1
    I somehow didn't notice it was basically the same ridgerunner's third answer, though. My mistake. – Andrew Traviss Dec 03 '13 at 16:38
  • 2
    Looking back, I shouldn't have mentioned efficiency. The difference in performance between capturing and non-capturing groups is so tiny, it will almost never have a significant effect on overall performance. It certainly won't matter for regexes as simple as this one. – Alan Moore Dec 03 '13 at 17:52
1

According to W3 resources : https://www.w3.org/TR/2010/REC-xpath20-20101214/#doc-xpath-StringLiteral

The general Regex is:

"(\\.|[^"])*"

(+ There is no need to add back-slashes in capturing group when they checked first)

Explain:

  • "..." any match between quotes
  • (...)* The inside can have any length from 0 to Infinity
  • \\.|[^"] First accept any char that have slash behind | (Or) Then accept any char that is not quotes

The PHP version of the regex with better grouping for better handling of Any Quotes can be like this :

<?php
    $str='"First \\" \n Second" then \'This \\\' That\'';
    echo $str."\n";
    // "First \" \n Second" then 'This \' That'

    $RX_inQuotes='/"((\\\\.|[^"])*)"/';
    preg_match_all($RX_inQuotes,$str,$r,PREG_SET_ORDER);
    echo $r[0][1]."\n";
    // First \" \n Second

    $RX_inAnyQuotes='/("((\\\\.|[^"])*)")|(\'((\\\\.|[^\'])*)\')/';
    preg_match_all($RX_inAnyQuotes,$str,$r,PREG_SET_ORDER);
    echo $r[0][2]." --- ".$r[1][5];
    // First \" \n Second --- This \' That
?>

Try it: http://sandbox.onlinephpfunctions.com/code/4328cc4dfc09183f7f1209c08ca5349bef9eb5b4

Important Note: In this age, for not sure contents, you have to use u flag in end of the regex like /.../u for avoid of destroying multi-byte strings like UTF-8, or functions like mb_ereg_match.

MMMahdy-PAPION
  • 915
  • 10
  • 15