0

So the regex for a quoted string has been solved over and over. A good answer seen here: https://stackoverflow.com/a/5696141/692331

$re_dq = '/"[^"\\\\]*(?:\\\\.[^"\\\\]*)*"/s';

Seems to be the standard solution for PHP.

My Issue is that my quotes are escaped by another quote. Example:

="123 4556 789 ""Product B v.24"""

="00 00 F0 FF ""Licence key for blah blah"" hfd.34"

=""

The previous strings should match the following, respectively:

string '123 4556 789 ""Product B v.24""' (length=31) 

string '00 00 F0 FF ""Licence key for blah blah"" hfd.34' (length=48) 

string '' (length=0) 

The examples given are just illustrations of what the string may look like and are not the actual strings I will be matching, which can number in the tens of thousands.

I need a regex pattern that will match a double quoted string which may OR MAY NOT contain sequences of two double quotes.

UPDATE 5/5/14:

See Answer Below

Community
  • 1
  • 1
Kenneth
  • 535
  • 2
  • 17
  • Do you want `Licence key for blah blah` as a separate matched group? – anubhava Apr 28 '14 at 21:41
  • No, each line should be a single group – Kenneth Apr 28 '14 at 21:52
  • Can you not replace `'""'` by `''` and then just grab all quoted strings? – anubhava Apr 28 '14 at 22:01
  • No, I need the quotes to remain so that I can properly escape them later – Kenneth Apr 28 '14 at 22:26
  • 1
    you don't need to escape the double quotes if you use them inside single quotes , ex: `preg_match('/"123 4556 789 ""Product B v\.24"""/', $subject)` – Pedro Lobito Apr 28 '14 at 22:52
  • Why not just call: `str_getcsv($subject)[0];` – anubhava Apr 28 '14 at 23:01
  • @Tuga, I have no idea what will be in the quoted string. Not to mention there are thousands of them. I am working with a proprietary database export and processing with with a custom [Lexer](http://en.wikipedia.org/wiki/Lexical_analysis) in order to convert it to XML. – Kenneth Apr 29 '14 at 13:34
  • @anubhava, I don't think you understand the problem statement.. – Kenneth Apr 29 '14 at 13:37
  • Onus is always on OP to explain a problem clearly. After 16 hrs of posting a question and so many comments problem remains unclear with only 1 answer, something is amiss here. – anubhava Apr 29 '14 at 13:44
  • 1
    @anubhava you aren't missing anything the OP is, a clear question. – Pedro Lobito Apr 29 '14 at 13:57

2 Answers2

1

Edit: Per your request, minor mod to account for empty quotes.

(?<!")"(?:[^"]|"")*"

Original solution:

(?<!")"(?:[^"]|"")+"

Demo:

<?php
$string = '
"123 4556 789 ""Product B v.24"""
"00 00 F0 FF ""Licence key for blah blah"" hfd.34"';
$regex='~(?<!")"(?:[^"]|"")+"~';
$count = preg_match_all($regex,$string,$m);
echo $count."<br /><pre>";
print_r($m[0]);
echo "</pre>";
?>

Output:

2

Array
(
    [0] => "123 4556 789 ""Product B v.24"""
    [1] => "00 00 F0 FF ""Licence key for blah blah"" hfd.34"
)
zx81
  • 41,100
  • 9
  • 89
  • 105
  • @Kenneth FYI Added a demo. Let me know if this is what you are trying to match, and if not what tweaks are needed. – zx81 Apr 28 '14 at 21:51
  • Appears to work as expected on regex101.com. I will try is in my application in the morning and accept the answer if no issues arise. Thanks! – Kenneth Apr 28 '14 at 22:14
  • Your regex matches everything except an empty string "" (which I did not specify, but have modified the question). One minor mod accounts for that (?<!")"(?:[^"]|"")*" If you will modify the answer I will accept it. Thanks! – Kenneth Apr 29 '14 at 16:52
  • @Kenneth Per your request, I added your requested mod to the solution. :) – zx81 Apr 29 '14 at 19:09
1

I found that the pattern from zx81

$re_dq_answer = '/="(?:[^"]|"")*"/'

results in backtracking after every single matched character. I found that I could adapt the pattern found at the very top of my question to suit my need.

$re_dq_orignal = '/="[^"\\\\]*(?:\\\\.[^"\\\\]*)*"/s';

becomes

$re_dq_modified = '/="([^"]*(?:""[^"]*)*)"/';

The 's' pattern modifier isn't necessary because the pattern does not using the \s metacharacter.

The longest string I have had to match was 28,000 characters which caused Apache to crash on a stackoverflow. I had to increase the stack size to 32MB (linux default is 8mb, windows is 1mb) just to get by! I didn't want every thread to have this large stack size, so I started looking for a better solution.

Example (tested on Regex101): A string (length=3,200) which required 6,637 steps to match using $re_dq_answer now requires 141 steps using $re_dq_modified. Slight improvement I'd say!

Community
  • 1
  • 1
Kenneth
  • 535
  • 2
  • 17