1

I feel shamed but I'm still not clear with some regexp aspects. I need to parse text file which contains a number of string literals of @"I'm a string" format. I've composed simple pattern /@"([^"]*)"/si. It works perfect, preg_match_all returns a collection. But obviously it doesn't work properly if string literal contains escaped quotes like @"I'm plain string. I'm \"qouted\" string ". Would appreciate for any clue.

heximal
  • 10,327
  • 5
  • 46
  • 69

2 Answers2

2

This is a use case for Freidl's classic "unrolled loop": (EDIT fixed grouping for capture)

/"((?:[^"\\]|\\.)*)"/

This will match the quoted string, taking backslash-escaped quotes into account.

The full regex you would use to match a field (including the @) would be:

/@"((?:[^"\\]|\\.)*)"/

But be careful! I often see people complaining that this pattern doesn't work in PHP, and this is because of the slightly mind-melting nature of using a backslash in string.

The backslashes in the above pattern represent a literal backslash that needs to be passed into PCRE. This means that they need to be double-escaped when using them in a PHP string:

$expr = '/@"((?:[^"\\\\]|\\\\.)*)"/';

preg_match_all($expr, $subject, $matches);

print_r($matches[1]); // this will show the content of all the matched fields

See it working

How does it work?

...I hear you ask. Well, lets see if I can explain this in a way that actually makes sense. Let's enable x mode so we can space it out a bit:

/
  @             # literal @
  "             # literal "
    (           # start capture group, we want everything between the quotes
      (?:       # start non-capturing group (a group we can safely repeat)
        [^"\\]  # match any character that's not a " or a \
        |       # ...or...
        \\.     # a literal \ followed by any character
      )*        # close non-capturing group and allow zero or more occurrences
    )           # close the capture group
  "             # literal "
/x

This really important points are these:

  • [^"\\]|\\. - means that every backslash is "balanced" - every backslash must escape a character, and no character will be considered more than once.
  • Wrapping the above in a * repeated group means that the above pattern can occur an unlimited number of times, and that empty strings are allowed (if you don't want to allow empty strings, change the * to a +). This is the "loop" part of the "unrolled loop".

But the output string still contains the backslashes that escape the quotes?

Indeed it does, this is just a matching procedure, it doesn't modify the match. But because the result is the contents of the string, a simple str_replace('\\"', '"', $result) will be safe and produce the correct result.

However, when doing this sort of thing, I often find I want to handle other escape sequences as well - in which case I usually do something like this to the result:

 preg_replace_callback('/\\./', function($match) {
     switch ($match[0][1]) { // inspect the escaped character
         case 'r':
             return "\r";

         case 'n':
             return "\n";

         case 't':
             return "\t";

         case '\\':
             return '\\';

         case '"':
             return '"';

         default: // if it's not a valid escape sequence, treat the \ as literal
             return $match[0];
     }
 }, $result);

This gives similar behaviour to a double-quoted string in PHP, where \t is replaced with a tab, \n is replaced with a newline and so on.

What if I want to allow single-quoted strings as well?

This has bugged me for a very long time. I have always had a niggling feeling that this could be more efficiently handled with backreferences but numerous attempts have failed to yield any viable results.

I do this:

/(?:"((?:[^"\\]|\\.)*)")|(?:'((?:[^'\\]|\\.)*)')/

As you can see, this is basically just applying basically the same pattern twice, with an OR relationship. This complicates the string extraction very slightly on the PHP side as well:

$expr = '/(?:"((?:[^"\\\\]|\\\\.)*)")|(?:\'((?:[^\'\\\\]|\\\\.)*)\')/';

preg_match_all($expr, $subject, $matches);

$result = array();
for ($i = 0; isset($matches[0][$i]); $i++) {
    if ($matches[1][$i] !== '') {
        $result[] = $matches[1][$i];
    } else {
        $result[] = $matches[2][$i];
    }
}

print_r($result);
DaveRandom
  • 87,921
  • 11
  • 154
  • 174
  • Where `\\.` would incline any weird slash-escapes like `\t`, `\n` etc. +1 –  Mar 19 '13 at 12:23
  • @Allendar The beauty of this is that it will only break on an unescaped double quote, any other combination of backslash escapes will be left untouched. I will try and break down how it works in a comprehensible way. – DaveRandom Mar 19 '13 at 12:29
  • Thanks, You've made my day, Dave! non-capturing group is exactly what I was looking for and that was missed part of my knowledge about regexp – heximal Mar 19 '13 at 13:57
0

You need to use a negative lookbehind - match everything until you find a quote not preceded by a backslash. This is in java:

public static void main(String[] args) {
    final String[] strings = new String[]{"@\"I'm a string\"", "@\"I'm plain string. I'm \\\"qouted\\\" \""};

    final Pattern p = Pattern.compile("@\"(.*)\"(?<!\\\\)");
    System.out.println(p.pattern());

    for (final String string : strings) {
        final Matcher matcher = p.matcher(string);
        while (matcher.find()) {
            System.out.println(matcher.group(1));
        }
    }
}

Output:

I'm a string
I'm plain string. I'm \"qouted\" 

The pattern (without all the Java escapes) is : @"(.*)"(?<!\\)

Boris the Spider
  • 59,842
  • 6
  • 106
  • 166
  • Wasn't the question asked for PHP? –  Mar 19 '13 at 12:23
  • Regex is regex right? The regex pattern works, the OP just needs to escape whatever needs escaping in PHP. – Boris the Spider Mar 19 '13 at 12:24
  • 1
    PHP Regex actually has different rules, requiring extra escape quotes. Besides that it might confuse the person asking the question if he/she is absent the knowledge of JAVA, even tho it might make sense to most of us; it's not to commonly assume it's so. Your regex itself is correct tho :) –  Mar 19 '13 at 12:26
  • @Allendar I thought that was not to do with the `String` itself rather than the regex - i.e. in order to get a string literal that looks like the pattern you need to do the escapes differently. – Boris the Spider Mar 19 '13 at 12:29
  • You are correct on that matter, but he tagged his question with PHP. It's commonly known Regular Expressions have some weird outcomes in different languages. I just wanted to point that out. No worries tho, I like your answer :) –  Mar 19 '13 at 12:32