I feel shamed but I'm still not clear with some regexp aspects.
I need to parse text file which contains a number of string literals of @"I'm a string"
format.
I've composed simple pattern /@"([^"]*)"/si
. It works perfect, preg_match_all returns a collection. But obviously it doesn't work properly if string literal contains escaped quotes like @"I'm plain string. I'm \"qouted\" string "
. Would appreciate for any clue.

- 10,327
- 5
- 46
- 69
-
I actually tried your escaped string and the pattern seems to work – aleation Mar 19 '13 at 12:15
-
yes, the pattern works, but value grabbed by placeholder ([^"]*) is not as expected – heximal Mar 19 '13 at 12:18
-
`preg_match_all('/@"(.*)"$/si', $text, $match);` .? – MatRt Mar 19 '13 at 12:20
-
You're right. but using this it works for me: /@"(.*)"/si – aleation Mar 19 '13 at 12:22
-
Seems to be ok : `preg_match_all('/@"([^"]|\\")*"/si', $text, $match);`. Try it here : http://sandbox.onlinephpfunctions.com/code/d62a5e00484640badbb8f48ece0c98870ab66b49 – MatRt Mar 19 '13 at 12:25
-
see http://stackoverflow.com/q/6243778/592540 – Carlos Campderrós Mar 19 '13 at 12:29
-
Just have to adapt a little... `preg_match_all('/(@"([^"]*|(\\"))")/si', $text, $match);` is working on your new string example : http://sandbox.onlinephpfunctions.com/code/59eb5bc6e0ad36ec8919d356e805f73b21ef084a – MatRt Mar 19 '13 at 12:36
2 Answers
This is a use case for Freidl's classic "unrolled loop": (EDIT fixed grouping for capture)
/"((?:[^"\\]|\\.)*)"/
This will match the quoted string, taking backslash-escaped quotes into account.
The full regex you would use to match a field (including the @
) would be:
/@"((?:[^"\\]|\\.)*)"/
But be careful! I often see people complaining that this pattern doesn't work in PHP, and this is because of the slightly mind-melting nature of using a backslash in string.
The backslashes in the above pattern represent a literal backslash that needs to be passed into PCRE. This means that they need to be double-escaped when using them in a PHP string:
$expr = '/@"((?:[^"\\\\]|\\\\.)*)"/';
preg_match_all($expr, $subject, $matches);
print_r($matches[1]); // this will show the content of all the matched fields
How does it work?
...I hear you ask. Well, lets see if I can explain this in a way that actually makes sense. Let's enable x
mode so we can space it out a bit:
/
@ # literal @
" # literal "
( # start capture group, we want everything between the quotes
(?: # start non-capturing group (a group we can safely repeat)
[^"\\] # match any character that's not a " or a \
| # ...or...
\\. # a literal \ followed by any character
)* # close non-capturing group and allow zero or more occurrences
) # close the capture group
" # literal "
/x
This really important points are these:
[^"\\]|\\.
- means that every backslash is "balanced" - every backslash must escape a character, and no character will be considered more than once.- Wrapping the above in a
*
repeated group means that the above pattern can occur an unlimited number of times, and that empty strings are allowed (if you don't want to allow empty strings, change the*
to a+
). This is the "loop" part of the "unrolled loop".
But the output string still contains the backslashes that escape the quotes?
Indeed it does, this is just a matching procedure, it doesn't modify the match. But because the result is the contents of the string, a simple str_replace('\\"', '"', $result)
will be safe and produce the correct result.
However, when doing this sort of thing, I often find I want to handle other escape sequences as well - in which case I usually do something like this to the result:
preg_replace_callback('/\\./', function($match) {
switch ($match[0][1]) { // inspect the escaped character
case 'r':
return "\r";
case 'n':
return "\n";
case 't':
return "\t";
case '\\':
return '\\';
case '"':
return '"';
default: // if it's not a valid escape sequence, treat the \ as literal
return $match[0];
}
}, $result);
This gives similar behaviour to a double-quoted string in PHP, where \t
is replaced with a tab, \n
is replaced with a newline and so on.
What if I want to allow single-quoted strings as well?
This has bugged me for a very long time. I have always had a niggling feeling that this could be more efficiently handled with backreferences but numerous attempts have failed to yield any viable results.
I do this:
/(?:"((?:[^"\\]|\\.)*)")|(?:'((?:[^'\\]|\\.)*)')/
As you can see, this is basically just applying basically the same pattern twice, with an OR relationship. This complicates the string extraction very slightly on the PHP side as well:
$expr = '/(?:"((?:[^"\\\\]|\\\\.)*)")|(?:\'((?:[^\'\\\\]|\\\\.)*)\')/';
preg_match_all($expr, $subject, $matches);
$result = array();
for ($i = 0; isset($matches[0][$i]); $i++) {
if ($matches[1][$i] !== '') {
$result[] = $matches[1][$i];
} else {
$result[] = $matches[2][$i];
}
}
print_r($result);

- 87,921
- 11
- 154
- 174
-
-
@Allendar The beauty of this is that it will only break on an unescaped double quote, any other combination of backslash escapes will be left untouched. I will try and break down how it works in a comprehensible way. – DaveRandom Mar 19 '13 at 12:29
-
Thanks, You've made my day, Dave! non-capturing group is exactly what I was looking for and that was missed part of my knowledge about regexp – heximal Mar 19 '13 at 13:57
You need to use a negative lookbehind - match everything until you find a quote not preceded by a backslash. This is in java:
public static void main(String[] args) {
final String[] strings = new String[]{"@\"I'm a string\"", "@\"I'm plain string. I'm \\\"qouted\\\" \""};
final Pattern p = Pattern.compile("@\"(.*)\"(?<!\\\\)");
System.out.println(p.pattern());
for (final String string : strings) {
final Matcher matcher = p.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
}
}
Output:
I'm a string
I'm plain string. I'm \"qouted\"
The pattern (without all the Java escapes) is : @"(.*)"(?<!\\)

- 59,842
- 6
- 106
- 166
-
-
Regex is regex right? The regex pattern works, the OP just needs to escape whatever needs escaping in PHP. – Boris the Spider Mar 19 '13 at 12:24
-
1PHP Regex actually has different rules, requiring extra escape quotes. Besides that it might confuse the person asking the question if he/she is absent the knowledge of JAVA, even tho it might make sense to most of us; it's not to commonly assume it's so. Your regex itself is correct tho :) – Mar 19 '13 at 12:26
-
@Allendar I thought that was not to do with the `String` itself rather than the regex - i.e. in order to get a string literal that looks like the pattern you need to do the escapes differently. – Boris the Spider Mar 19 '13 at 12:29
-
You are correct on that matter, but he tagged his question with PHP. It's commonly known Regular Expressions have some weird outcomes in different languages. I just wanted to point that out. No worries tho, I like your answer :) – Mar 19 '13 at 12:32