10

Here's my string:

address='St Marks Church',notes='The North East\'s premier...'

The regex I'm using to grab the various parts using match_all is

'/(address|notes)='(.+?)'/i'

The results are:

address => St Marks Church
notes => The North East\

How can I get it to ignore the \' character for the notes?

mickmackusa
  • 43,625
  • 12
  • 83
  • 136
Paul Phillips
  • 1,480
  • 1
  • 15
  • 23
  • 1
    Would you want to only consider alphanumeric characters in your expression? –  Jun 06 '13 at 19:48
  • No basically anything between ' and the second ' excluding \'. I'm a bit of a regex newbie I'm afraid so probably got the first bit wrong too? – Paul Phillips Jun 06 '13 at 19:51

3 Answers3

5

Not sure if you're wrapping your string with heredoc or double quotes, but a less greedy approach:

$str4 = 'address="St Marks Church",notes="The North East\'s premier..."';
preg_match_all('~(address|notes)="([^"]*)"~i',$str4,$matches);
print_r($matches);

Output

Array
(
    [0] => Array
        (
            [0] => address="St Marks Church"
            [1] => notes="The North East's premier..."
        )

    [1] => Array
        (
            [0] => address
            [1] => notes
        )

    [2] => Array
        (
            [0] => St Marks Church
            [1] => The North East's premier...
        )

)

Another method with preg_split:

//split the string at the comma
//assumes no commas in text
$parts = preg_split('!,!', $string);
foreach($parts as $key=>$value){
    //split the values at the = sign
    $parts[$key]=preg_split('!=!',$value);
    foreach($parts[$key] as $k2=>$v2){
        //trim the quotes out and remove the slashes
        $parts[$key][$k2]=stripslashes(trim($v2,"'"));
    }
}

Output looks like:

Array
(
    [0] => Array
        (
            [0] => address
            [1] => St Marks Church
        )

    [1] => Array
        (
            [0] => notes
            [1] => The North East's premier...
        )

)

Super slow old-skool method:

$len = strlen($string);
$key = "";
$value = "";
$store = array();
$pos = 0;
$mode = 'key';
while($pos < $len){
  switch($string[$pos]){
    case $string[$pos]==='=':
        $mode = 'value';
        break;
    case $string[$pos]===",":
        $store[$key]=trim($value,"'");
        $key=$value='';
        $mode = 'key';
        break;
    default:
        $$mode .= $string[$pos];
  }

  $pos++;
}
        $store[$key]=trim($value,"'");
AbsoluteƵERØ
  • 7,816
  • 2
  • 24
  • 35
  • Your first method adjusts the input string to suit the method, this method should be removed. The second uses `preg_split ()` where `explode()` is the sensible function call. Furthermore, if `\'` is possible in the string, then it is fair to assume `,` and `=` are possible as well. The third one, I didn't test yet but it either has a typo or is employing variable variables which should be avoided whenever possible. – mickmackusa Nov 21 '17 at 08:43
  • I removed my downvote because I appreciate that you are trying to fix your answer. Sadly, I feel I had to re-downvote because this answer is suggesting poor and/or unreliable methods. – mickmackusa Nov 21 '17 at 09:39
  • Making concessions for bad data storage methods is never advisable. This text stream should be stored in JSON, XML, or even CSV and processed with industry standard methods ideally. Appreciate your opinion though. – AbsoluteƵERØ Nov 21 '17 at 17:31
2

Because you have posted that you are using match_all and the top tags in your profile are php and wordpress, I think it is fair to assume you are using preg_match_all() with php.

The following patterns will match the substrings required to buildyour desired associative array:

Patterns that generate a fullstring match and 1 capture group:

  1. /(address|notes)='\K(?:\\\'|[^'])*/ (166 steps, demo link)
  2. /(address|notes)='\K.*?(?=(?<!\\)')/ (218 steps, demo link)

Patterns that generate 2 capture groups:

  1. /(address|notes)='((?:\\\'|[^'])*)/ (168 steps, demo link)
  2. /(address|notes)='(.*?(?<!\\))'/ (209 steps, demo link)

Code: (Demo)

$string = "address='St Marks Church',notes='The North East\'s premier...'";

preg_match_all(
    "/(address|notes)='\K(?:\\\'|[^'])*/",
    $string,
    $out
);
var_export(array_combine($out[1], $out[0]));

echo "\n---\n";

preg_match_all(
    "/(address|notes)='((?:\\\'|[^'])*)/",
    $string,
    $out,
    PREG_SET_ORDER
);
var_export(array_column($out, 2, 1));

Output:

array (
  'address' => 'St Marks Church',
  'notes' => 'The North East\\\'s premier...',
)
---
array (
  'address' => 'St Marks Church',
  'notes' => 'The North East\\\'s premier...',
)

Patterns #1 and #3 use alternatives to allow non-apostrophe characters or apostrophes not preceded by a backslash.

Patterns #2 and #4 (will require an additional backslash when implemented with php demo) use lookarounds to ensure that apostrophes preceded by a backslash don't end the match.

Some notes:

  • Using capture groups, alternatives, and lookarounds often costs pattern efficiency. Limiting the use of these components often improves performance. Using negated character classes with greedy quantifiers often improves performance.

  • Using \K (which restarts the fullstring match) is useful when trying to reduce capture groups and it reduces the size of the output array.

mickmackusa
  • 43,625
  • 12
  • 83
  • 136
  • @PaulPhillips over 4 years later, you may no longer be a newbie at regex. Please review all of the answers on this page. Sadly the other answers on this page are inaccurate/incorrect and have gathered upvotes over time (which means they have been misinforming readers for years). If you have any questions about my answer or why the other answers are not correct, I will be happy to explain. – mickmackusa Nov 19 '17 at 07:01
  • Hey Mick you trolling everybody's past answers or just mine? – AbsoluteƵERØ Nov 20 '17 at 21:07
  • I happened upon this page while researching for another question on another StackExchange site. There is nothing trollish about my conduct. If I wanted to be a troll, I would call you names or more simply not leave a comment. No, what I have done is identified a page that contained 3 incorrect answers (now 2 after anubhava deleted his), justifiably downvoted bad answers that misinform, left explanatory comments (with demo links), edited the question, and provided a comprehensive and thoughtful answer. What I have done should only be consider "content improvement". – mickmackusa Nov 20 '17 at 22:23
  • I'm guessing it used to work (though I'm not sure how) otherwise people just glanced and thought it worked, though it was marked as the answer, so it likely helped the OP to figure out their issue. Whatever. – AbsoluteƵERØ Nov 21 '17 at 01:14
  • It never worked as intended. The OP blindly trusted the answers. The snowball grew as the blind trusted the blind for years. – mickmackusa Nov 21 '17 at 01:19
1

You should match up to an end quote that isn't preceded by a backslash thus:

(address|notes)='(.*?)[^\\]'

This [^\\] forces the character immediately preceding the ' character to be anything but a backslash.

Kimball Robinson
  • 3,287
  • 9
  • 47
  • 59
  • Will that work if input is: `"address='.',notes='The North East\'s premier...'"` ? – anubhava Jun 06 '13 at 19:59
  • As @anubhava alluded to, this answer is incorrect and will mangle the expected return values. https://regex101.com/r/90fBSr/1 (downvoted as misleading) – mickmackusa Nov 18 '17 at 12:21