1

I am handling some XPath strings that I want to modify before actually using them. For those of you who are not familiar with XPath, XPath is -- in short -- a way of resembling an XML structure, and it is often used as a formal input for an XPath/XQuery-based search engine.

The goal

To see an expanded/prettified version of the XPath snippets below, I can direct you to the following beautifier. Disclaimer, I am the author of that tool.

My XPath strings can be quite simple

//node[@cat="smain" and node[@rel="su" and @pt="vnw"] and node[@rel="hd" and @pt="ww"] and node[@rel="predc" and @cat="np" and node[@rel="det" and @pt="lid"] and node[@rel="hd" and @pt="n"]]] 

but also very elaborate

//node[@cat="top" and node[@rel="--" and @cat="smain" and node[@rel="su" and @pt="vnw" and @word="Dit" and @lemma="dit" and number(@begin) < ../node[@rel="hd" and @pt="ww" and @lemma="zijn"]/number(@begin)] and node[@rel="hd" and @pt="ww" and @lemma="zijn" and number(@begin) < ../node[@rel="predc" and @cat="np"]/node[@rel="det" and @pt="lid" and @word="een" and @cs="no" and @lemma="een"]/number(@begin)] and node[@rel="predc" and @cat="np" and node[@rel="det" and @pt="lid" and @word="een" and @cs="no" and @lemma="een" and number(@begin) < ../node[@rel="hd" and @pt="n" and @cs="no" and @lemma="zin"]/number(@begin)] and node[@rel="hd" and @pt="n" and @cs="no" and @lemma="zin" and number(@begin) < ../../../node[@rel="--" and @pt="let"]/number(@begin)]]] and node[@rel="--" and @pt="let"]]

As you may have noticed, a node is the basic element that is used. There are no other element names. However, attributes differ. The attributes that I am interested in are @cs="no", which means that case sensitivity is not wanted in a future search request on the attributes @word and/or @lemma. To accomplish case insensitivity I want to transform these two attributes into lower-case(@attr). The thing is that I only want that for nodes that contain @cs="no".

What I tried so far

In PHP, I thought I'd be a smart guy and do something like this:

  1. Check to see if the (XPath) string matches @cs="no"
  2. If so, find all individual nodes with a regular expression

    preg_match_all("/(?<=node\[).*?(?=node\[|\])/", $xpath, $matches);
    
  3. Loop through all these matches (strings), and check if they contain @cs="no" again

  4. If so, remove that attribute, and replace the @word and @lemma tags with the lower-case equivalent. Place result in dummy variable.

And now comes the tricky part:

  1. In the original XPath string, find and replace the matched substring by the dummy variable.

You can see this in action here, but I also duplicated the PHP code below.

  <?php
  $xpath = '//node[@cat="top" and node[@rel="--" and @cat="smain" and node[@rel="su" and @pt="vnw" and @word="Dit" and @lemma="dit" and number(@begin) < ../node[@rel="hd" and @pt="ww" and @lemma="zijn"]/number(@begin)] and node[@rel="hd" and @pt="ww" and @lemma="zijn" and number(@begin) < ../node[@rel="predc" and @cat="np"]/node[@rel="det" and @pt="lid" and @word="een" and @cs="no" and @lemma="een"]/number(@begin)] and node[@rel="predc" and @cat="np" and node[@rel="det" and @pt="lid" and @word="een" and @cs="no" and @lemma="een" and number(@begin) < ../node[@rel="hd" and @pt="n" and @cs="no" and @lemma="zin"]/number(@begin)] and node[@rel="hd" and @pt="n" and @cs="no" and @lemma="zin" and number(@begin) < ../../../node[@rel="--" and @pt="let"]/number(@begin)]]] and node[@rel="--" and @pt="let"]]';
  $xpath = applyCs($xpath);

  var_dump($xpath);

  function applyCs($xpath) {
    if (strpos($xpath, '@cs="no"') !== false) {
      preg_match_all("/(?<=node\[).*?(?=node\[|\])/", $xpath, $matches);
      foreach ($matches as $match) {
        var_dump($match);
        if (strpos($match, '@cs="no"') !== false) {
          $dummyMatch = preg_replace('/(?:and )?@cs="no"/', '', $match);

            if (strpos($dummyMatch, '@word="') !== false) {
                $dummyMatch = str_replace('@word="', 'lower-case(@word)="', $dummyMatch);
            }
            if (strpos($dummyMatch, '@lemma="') !== false) {
                $dummyMatch = str_replace('@lemma="', 'lower-case(@lemma)="', $dummyMatch);
            }

            $xpath = str_replace($match, $dummyMatch, $xpath);
        }
      }
    }
    return $xpath;
  }

Problems with my function

First of all you will see in the Ideone example provided via the link above that the first node with a word attribute does not have the @cs="no" attribute, yet in the resulting XPath it does get lower-case()'d. Secondly, something that you may not see reproduced in the example: because I simply find-and-replace the old match with the new dummy, it is very well possible that I replace values in nodes of the original XPath for which there is no @cs attribute available. I obviously do not want that. And finally I am not sure this is the best way. Efficiency is important to me, and I mostly don't like using regular expressions because of it. That's why I am using strpos and str_replace as much as I can. However, if there is a way to "parse" XPath (similarly to how you can parse XML in Perl with Twig for instance), and manipulate the XPath accordingly in a fast way that's good as well. However, effectiveness is required above efficiency.

Tl;dr: in an XPath string, how can I replace an attribute by another string if its sister attribute is set (to a particular value) by using PHP without additional modules.

Ideas

  • Find a regular expression that can match each node without leaving out any gaps, and after editing a match where necessary simply glue all of them back together
  • Use PREG_OFFSET_CAPTURE to find the index of the match in the input XPath, and then one way or another replace the first hit you get from that index.
Community
  • 1
  • 1
Bram Vanroy
  • 27,032
  • 24
  • 137
  • 239

2 Answers2

0

Got it.

First of all, there was a silly mistake in the loop: I should have used matches[0] instead of matches. The magic lies in the replace though. Instead of a string replace I am now using a preg_replace (which doesn't really make me happy... which allows me to limit the replacements I want to 1 because of its optional argument. Because the matches array is build from left to right, I can also assume that the replacement will happen in the correct order. The final code looks like this:

  function applyCs($xpath) {
    var_dump($xpath);
    if (strpos($xpath, '@cs="no"') !== false) {
      preg_match_all("/(?<=node\[).*?(?=node\[|\])/", $xpath, $matches);
      foreach ($matches[0] as $match) {
        if (strpos($match, '@cs="no"') !== false) {
          $dummyMatch = preg_replace('/(?: and )?@cs="no"/', '', $match);

            if (strpos($dummyMatch, '@word="') !== false) {
                $dummyMatch = str_replace('@word="', 'lower-case(@word)="', $dummyMatch);
            }
            if (strpos($dummyMatch, '@lemma="') !== false) {
                $dummyMatch = str_replace('@lemma="', 'lower-case(@lemma)="', $dummyMatch);
            }

            $xpath = preg_replace('/'.preg_quote($match, '/').'/', $dummyMatch, $xpath, 1);
        }
      }
    }
    return $xpath;
  }

I am leaving this question open for a while to look for more performant solutions

Bram Vanroy
  • 27,032
  • 24
  • 137
  • 239
0

This solution is working well with your current and any later similar XPath queries. I'm not sure if there could be any failure cases or not however.

The idea is extracting node declarations and then do a find / replace if there is an occurrence of @cs="no" in it.

Live demo

echo preg_replace_callback('~node\[(?:[^[]+(?=\]|node))~', function($match) {
    if (strpos($match[0], '@cs="no"') !== false) {
        return preg_replace(
            ['/@(lemma|word)/', '/\s*and\s*@cs="no"/'],
            ['lower-case(@$1)', ''],
            $match[0]
        );
    }
    return $match[0];
}, $xpathStr);
revo
  • 47,783
  • 14
  • 74
  • 117