1

this is useful, because I can then do for example this:
xPath->query('//div.class');

So I need regex which do this transforms:

Example 1
text().some_class => text()[contains(concat(" ", @class, " "), " some_class ")]
Example 2: nothing to do – it's in apostrophes
@src = 'obr.gif' => @src = 'obr.gif'
Example 3
*.class => *[contains(concat(" ", @class, " "), " class ")]
Example 4
div.class => div[contains(concat(" ", @class, " "), " class ")]
Example 5: do nothing – missing subject, which should have this class (I know, this is not valid xpath)
div[.neco] => div[.neco]

I used PHP preg_replace this way:

preg_replace(
        '/\.([a-z_][\w-]*)/i',
        '[contains(concat(" ", @class, " "), " $1 ")]',
        $xPath);

That only worked for examples No. 1, 3 and 4. So I updated it:

preg_replace(
        '/(?<=[\w*\])])\.([a-z_][\w-]*)/i',
        '[contains(concat(" ", @class, " "), " $1 ")]',
        $xPath);

Then only No 2 didn't work. I tried this:

preg_replace(
        '/(\'[^\']+\'.*?)*(?<=[\w*\])])\.([a-z_][\w-]*)/i',
        '$1[contains(concat(" ", @class, " "), " $2 ")]',
        $xPath);

That works for:
//div[@src = 'obr.gif'].class => //div[@src = 'obr.gif'][contains(concat(" ", @class, " "), " class ")]
But for (No 2) that do it wrong:
@src = 'obr.gif' => @src = 'obr[contains(concat(" ", @class, " "), " gif ")]'
I realize that PHP tries hard to match at least something, so "ignore" first parentheses, but I don't know, how to make regex which would works according to me.

PS: I'm only using single quotes in xPath expression, thus I do not care about quotes.

EDIT: Modified funkwurm answer for PHP

preg_replace_callback(<<<'CLASS'
        /('|").*?(?<!\\)\1|(?<=[\w*\])])\.([a-z_][\w-]*)/i
CLASS
        , function($matches) {
            return $matches[1] ? $matches[0] : "[contains(concat(\" \", @class, \" \"), \" $matches[2] \")]";
        },
        $xPath
);

I'm using nowdoc syntax for regex entry, because then I don't have to deal with escaping in quoted strings.

Velda
  • 587
  • 2
  • 5
  • 21

1 Answers1

0

The best approach here is to use a "Match this unless condition A|B" method further explained here and with an example here.

I would make the regex like so:

('|")(?:(?!\\|\1).|\\.)*\1|([\w*\])])\.([a-z_][\w-]*)

Regular expression visualization

Debuggex Demo

In your programming language you then check whether the 2nd capture group has any content. If so, then that is a class in you wanna do your existing substitution. Else you don't wanna do anything, which might mean you replace it with the match itself. A JavaScript implementation below. Note that I get the match m, capture-group of the quote q, the last character before the . in e and capture-group of the class c. If c is undefined I return the entire match m. Else I do the substitution.

var xpaths = [
  'text().some_class',            // => text()[contains(concat(" ", @class, " "), " some_class ")]
  '@src = \'obr.gif\'',           // => @src = 'obr.gif'
  '*.class',                      // => *[contains(concat(" ", @class, " "), " class ")]
  'div.class',                    // => div[contains(concat(" ", @class, " "), " class ")]
  'div[.neco]',                   // => div[.neco]
  'div[@src = \'obr.gif\'].class',// => div[@src = 'obr.gif'][contains(concat(" ", @class, " "), " class ")]
  'div[.//img.class]'             // => div[.//img[contains(concat(" ", @class, " "), " class ")]]
];

document.getElementById('out').value=xpaths.map(function(str) {
  return str.replace(/('|")(?:(?!\\|\1).|\\.)*\1|([\w*\])])\.([a-z_][\w-]*)/ig, function(m, q, e, c) {
    return (c==undefined)?m:(e+'[contains(concat(" ", @class, " "), " ' + c + ' ")]');
  });
}).join('\n');
<textarea id="out" rows="10" style="width:100%"></textarea>
Community
  • 1
  • 1
asontu
  • 4,548
  • 1
  • 21
  • 29
  • Thank you for repair in class matching rule (underline). :-) But I do not understand good this `\[[^\]]*\]` alternative because it cause fail with this xPath expression `div[.//img.class]`, where class of image isn't handled. – Velda Feb 06 '15 at 14:52
  • Thank You, I'm able to modificate you regex to work with previous case and slightly simpilfy it with use of look-behind. I removed `\[[^\]]*\]` part, instead of I added back `(?<=[\w*\])])`. That resolve previous case. This part `(['"])(?:(?!\\|\1).|\\.)*\1` I simplified into `('|").*?(?<!\\)\1` which is shorter, thus easier to understand (I hope it's nearly equivalent). Your solution deal with double quotes and escaped quotes, good. :-) – Velda Feb 06 '15 at 15:30
  • Ah, I thought that anything between `[]` would assumed to not be a class. But checking if it's preceded by `[\w*\])]` works too, I changed my answer to have it work in JavaScript without lookbehind. If my answer helped you, could you click the "accept"? It helps us both and the site :) – asontu Feb 06 '15 at 16:22