1

I have some text of the form:

This is some text, and here's some in "double quotes"
"and here's a double quote:\" and some more", "text that follows"

The text contains strings within double quotes, as can be seen above. A double quoted may be escaped with a backslash (\). In the above, there are three such strings:

"double quotes"
"and here's a double quote:\" and some more"
"text that follows"

To extract these strings, I tried the regex:

"(?:\\"|.)*?"

However, this gives me the following results:

>>> preg_match_all('%"(?:\\"|.)*?"%', $msg, $matches)
>>> $matches
[
  [ "double quotes",
    "and here's a double quote:\",
    ", "
  ]
]

How can I correctly obtain the strings?

  • You're almost near just an escaping issue. To escape a backslash you have to do this `'%"(?:\\\\"|.)*?"%'`. – revo Mar 18 '18 at 09:26

3 Answers3

2

One way to do it would involve neg. lookbehinds:

".*?(?<!\\)"


Which in PHP would be:
<?php

$text = <<<TEXT
This is some text, and here's some in "double quotes"
"and here's a double quote:\" and some more", "text that follows"
TEXT;

$regex = '~".*?(?<!\\\\)"~';

if (preg_match_all($regex, $text, $matches)) {
    print_r($matches);
}
?>


This yields
Array
(
    [0] => Array
        (
            [0] => "double quotes"
            [1] => "and here's a double quote:\" and some more"
            [2] => "text that follows"
        )

)


See a demo on regex101.com.
To let it span multiple lines, enable the dotall mode via
"(?s:.*?)(?<!\\)"

See a demo for the latter on regex101.com as well.

Jan
  • 42,290
  • 8
  • 54
  • 79
  • 1
    Testing if a quote is preceded by a backslash doesn't prove anything. You don't know if the backslash is escaped itself or not by another backslash. (in other words you don't know if the number of backslashes before the quote is odd or even). – Casimir et Hippolyte Mar 17 '18 at 22:02
  • @CasimiretHippolyte: While this is true there was no requirement to do so in OP's question. – Jan Mar 18 '18 at 13:41
1

If you let the regex capture backslash characters as characters, then it will terminate your capture group on the " of \" (because the preceding \ is considered a single character). So what you need to do is allow \" to be captured, but not \ or " individually. The result is the following regex:

"((?:[^"\\]*(?:\\")*)*)"

Try it here!

Explained in detail below:

"                begin with a single quote character
(                capture only what follows (within " characters)
  (?:            don't break into separate capture groups
    [^"\\]*      capture any non-" non-\ characters, any number of times
    (?:\\")*     capture any \" escape sequences, any number of times
  )*             allow the previous two groups to occur any number of times, in any order
)                end the capture group
"                make sure it ends with a "

Note that, in many languages, when feeding a regex string to a method to parse some text, you'll need to escape the backslash characters, quotes, etc. In PHP, the above would become:

'/"((?:[^"\\\\]*(?:\\\\")*)*)"/'
awwsmm
  • 1,353
  • 1
  • 18
  • 28
  • Unfortunately, this gives me a `missing terminating ] for character class`: https://repl.it/repls/RecklessConstantLesson –  Mar 17 '18 at 20:14
  • 1
    @user2064000 Using PHP, you have to escape backslashes: `'/"(?:(?:[^"\\\\])*(?:\\\\")*)*"/'` – Syscall Mar 17 '18 at 20:17
  • @Syscall, yeah, got them mixed up with bash syntax. –  Mar 17 '18 at 20:19
  • 2
    What's with all of the unnecessary non capturing groups? Groups cost steps. – mickmackusa Mar 17 '18 at 21:38
  • 1
    That's more or less correct (because it doesn't handle escaped characters that aren't quotes), but please remove all of these useless groups. – Casimir et Hippolyte Mar 17 '18 at 21:38
  • https://regex101.com/r/uB6lqJ/1 Removing groups takes your step count from 208 to 49 on the sample input string. – mickmackusa Mar 17 '18 at 21:39
  • Thanks all, updated the solution to remove the extra capture groups and added a note about escaping backslashes. @user2064000, if this answers your question, please accept it as the correct answer. – awwsmm Mar 17 '18 at 23:12
1

If you echo your pattern, you'll see it's indeed passed as %"(?:\"|.)*?"% to the regex parser. The single backslash will be treated as an escape character even by the regex parser.

So you need to add at least one more backslash if the pattern is inside single quotes to pass two backslashes to the parser (one for escaping backlsash) that the pattern will be: %"(?:\\"|.)*?"%

preg_match_all('%"(?:\\\"|.)*?"%', $msg, $matches);

Still this isn't a very efficient pattern. The question seems actually a duplicate of this one.

There is a better pattern available in this answer (what some would call unrolled).

preg_match_all('%"[^"\\\]*(?:\\\.[^"\\\]*)*"%', $msg, $matches);

See demo at eval.in or compare steps with other patterns in regex101.

bobble bubble
  • 16,888
  • 3
  • 27
  • 46