I am handling some XPath strings that I want to modify before actually using them. For those of you who are not familiar with XPath, XPath is -- in short -- a way of resembling an XML structure, and it is often used as a formal input for an XPath/XQuery-based search engine.
The goal
To see an expanded/prettified version of the XPath snippets below, I can direct you to the following beautifier. Disclaimer, I am the author of that tool.
My XPath strings can be quite simple
//node[@cat="smain" and node[@rel="su" and @pt="vnw"] and node[@rel="hd" and @pt="ww"] and node[@rel="predc" and @cat="np" and node[@rel="det" and @pt="lid"] and node[@rel="hd" and @pt="n"]]]
but also very elaborate
//node[@cat="top" and node[@rel="--" and @cat="smain" and node[@rel="su" and @pt="vnw" and @word="Dit" and @lemma="dit" and number(@begin) < ../node[@rel="hd" and @pt="ww" and @lemma="zijn"]/number(@begin)] and node[@rel="hd" and @pt="ww" and @lemma="zijn" and number(@begin) < ../node[@rel="predc" and @cat="np"]/node[@rel="det" and @pt="lid" and @word="een" and @cs="no" and @lemma="een"]/number(@begin)] and node[@rel="predc" and @cat="np" and node[@rel="det" and @pt="lid" and @word="een" and @cs="no" and @lemma="een" and number(@begin) < ../node[@rel="hd" and @pt="n" and @cs="no" and @lemma="zin"]/number(@begin)] and node[@rel="hd" and @pt="n" and @cs="no" and @lemma="zin" and number(@begin) < ../../../node[@rel="--" and @pt="let"]/number(@begin)]]] and node[@rel="--" and @pt="let"]]
As you may have noticed, a node
is the basic element that is used. There are no other element names. However, attributes differ. The attributes that I am interested in are @cs="no"
, which means that case sensitivity is not wanted in a future search request on the attributes @word
and/or @lemma
. To accomplish case insensitivity I want to transform these two attributes into lower-case(@attr)
. The thing is that I only want that for nodes that contain @cs="no"
.
What I tried so far
In PHP, I thought I'd be a smart guy and do something like this:
- Check to see if the (XPath) string matches
@cs="no"
If so, find all individual nodes with a regular expression
preg_match_all("/(?<=node\[).*?(?=node\[|\])/", $xpath, $matches);
Loop through all these matches (strings), and check if they contain
@cs="no"
again- If so, remove that attribute, and replace the
@word
and@lemma
tags with the lower-case equivalent. Place result in dummy variable.
And now comes the tricky part:
- In the original XPath string, find and replace the matched substring by the dummy variable.
You can see this in action here, but I also duplicated the PHP code below.
<?php
$xpath = '//node[@cat="top" and node[@rel="--" and @cat="smain" and node[@rel="su" and @pt="vnw" and @word="Dit" and @lemma="dit" and number(@begin) < ../node[@rel="hd" and @pt="ww" and @lemma="zijn"]/number(@begin)] and node[@rel="hd" and @pt="ww" and @lemma="zijn" and number(@begin) < ../node[@rel="predc" and @cat="np"]/node[@rel="det" and @pt="lid" and @word="een" and @cs="no" and @lemma="een"]/number(@begin)] and node[@rel="predc" and @cat="np" and node[@rel="det" and @pt="lid" and @word="een" and @cs="no" and @lemma="een" and number(@begin) < ../node[@rel="hd" and @pt="n" and @cs="no" and @lemma="zin"]/number(@begin)] and node[@rel="hd" and @pt="n" and @cs="no" and @lemma="zin" and number(@begin) < ../../../node[@rel="--" and @pt="let"]/number(@begin)]]] and node[@rel="--" and @pt="let"]]';
$xpath = applyCs($xpath);
var_dump($xpath);
function applyCs($xpath) {
if (strpos($xpath, '@cs="no"') !== false) {
preg_match_all("/(?<=node\[).*?(?=node\[|\])/", $xpath, $matches);
foreach ($matches as $match) {
var_dump($match);
if (strpos($match, '@cs="no"') !== false) {
$dummyMatch = preg_replace('/(?:and )?@cs="no"/', '', $match);
if (strpos($dummyMatch, '@word="') !== false) {
$dummyMatch = str_replace('@word="', 'lower-case(@word)="', $dummyMatch);
}
if (strpos($dummyMatch, '@lemma="') !== false) {
$dummyMatch = str_replace('@lemma="', 'lower-case(@lemma)="', $dummyMatch);
}
$xpath = str_replace($match, $dummyMatch, $xpath);
}
}
}
return $xpath;
}
Problems with my function
First of all you will see in the Ideone example provided via the link above that the first node with a word
attribute does not have the @cs="no"
attribute, yet in the resulting XPath it does get lower-case()
'd. Secondly, something that you may not see reproduced in the example: because I simply find-and-replace the old match with the new dummy, it is very well possible that I replace values in nodes of the original XPath for which there is no @cs
attribute available. I obviously do not want that. And finally I am not sure this is the best way. Efficiency is important to me, and I mostly don't like using regular expressions because of it. That's why I am using strpos
and str_replace
as much as I can. However, if there is a way to "parse" XPath (similarly to how you can parse XML in Perl with Twig for instance), and manipulate the XPath accordingly in a fast way that's good as well. However, effectiveness is required above efficiency.
Tl;dr: in an XPath string, how can I replace an attribute by another string if its sister attribute is set (to a particular value) by using PHP without additional modules.
Ideas
- Find a regular expression that can match each node without leaving out any gaps, and after editing a match where necessary simply glue all of them back together
- Use PREG_OFFSET_CAPTURE to find the index of the match in the input XPath, and then one way or another replace the first hit you get from that index.