0

I have many XML files, and i have to search in these files a string (in detail that will be a not-too-complicated regex).

With the results i want to get the xpath of the node in which the string is, i.e.:

pattern = /home|house/

files: file1.xml, file2.xml etc

Results:

"home" in file1.xml, xpath: //root/cars/car[2]
"house" in file2.xml, xpath: //root[1]/elemA[2][@attribute1='first']

How can i achieve this? I can use PHP, python, Javascript, VIM plugin (because i already worked with those)

apelliciari
  • 8,241
  • 9
  • 57
  • 92

3 Answers3

3

Search:

 //*[contains('home') or contains('house')]

In PHP:

Use DOMDocument & DOMXPath, and then just call DOMNode::getNodePath() on the resulting matches.

If you actually need a regex instead of those matches earlier, php's DOMDocument only has XPATH 1.0 functions, but you can add functionality to DOMXPath by adding a user defined function with DOMXPath::registerPhpFunctions

Whipping up something quick without to much error handling:

function xpathregexmatch($nodelist,$regex){
        foreach($nodelist as $node){
                if( $node instanceof DOMText && preg_match($regex,$node->nodeValue)) return true;
        }
        return false;
}

foreach(glob('*.xml') as $file){
        $d = new DOMDocument();
        $d->load($file);
        $x = new DOMXPath($d);
        $x->registerNamespace("php", "http://php.net/xpath");
        $x->registerPHPFunctions('xpathregexmatch');
        $matches = $x->query('//*[php:function("xpathregexmatch",text(),"/house|home/")]');
        if($matches->length){
                foreach($matches as $node){
                        echo $file. ':'.$node->getNodePath().PHP_EOL;
                }
        }
}
Wrikken
  • 69,272
  • 8
  • 97
  • 136
  • thanks, i'll got it! very helpful. i'll try to do it and let you know! – apelliciari Mar 06 '13 at 23:18
  • 1
    You could also use `functionString` and `preg_match()` directly instead of having the `xpathregexmatch()` function. – salathe Mar 06 '13 at 23:34
  • @salathe: good point, always forget about `functionString`... _But_: `$x->query('//*[php:functionString("preg_match","/wh[ab]t/",text())]');` or `$x->query('//*[php:functionString("preg_match","/wh[ab]t/",.)]'); ` also select all parent nodes because it then probably looks at `->textContent`... – Wrikken Mar 06 '13 at 23:44
  • 2
    @Wrikken no, it's because `preg_match()` returns 0 or 1 (or false); you'll want to check the return value inside the predicate if going down that route: `//*[php:functionString("preg_match", "/house|home/", text()) = "1"]` (or go wild and use `[boolean(number(php:functionString(…)))]`). – salathe Mar 06 '13 at 23:59
  • Ah. glad I ran into you today... Yes, less type-juggling it was... `//*[php:functionString("preg_match","/wh[ab]t/",text()) = "1"]` works then ;) – Wrikken Mar 07 '13 at 00:04
  • You don't need to check for length as you `froeach` later on. However I find the stringyfication of SimpleXML useful to give this a bit more of a cross-over: http://stackoverflow.com/a/15262125/367456 – hakre Mar 07 '13 at 03:02
  • @hakre: yeah, that's because `foreach`ing over empty arrays generates notices... so I check anything for content before a `foreach`, be it an actual array, or just iterable. – Wrikken Mar 07 '13 at 10:20
  • No notice for empty array in foreach. Also this one is an iteator, It should be really without any notice even if there is no match (only if the xpath fails, but then `$matches->length` won't work, too). – hakre Mar 07 '13 at 10:26
2

In PHP: glob the XML files, xpath all nodes, preg_match_all their text and if matches, get the nodes' xpath with getNodePath() and output it:

$pattern = '/home|house|guide/iu';

foreach (glob('data/*.xml') as $file)
{
    foreach (simplexml_load_file($file)->xpath('//*') as $node)
    {
        if (!preg_match_all($pattern, $node, $matches)) continue;

        printf(
            "\"%s\" in %s, xpath: %s\n", implode('", "', $matches[0]),
            basename($file), dom_import_simplexml($node)->getNodePath()
        );
    }
}

Result (exemplary):

"Guide" in iana-charsets-2013-03-05.xml, xpath: /*/*[7]/*[158]/*[4]
"Guide" in iana-charsets-2013-03-05.xml, xpath: /*/*[7]/*[224]/*[2]
"Guide" in iana-charsets-2013-03-05.xml, xpath: /*/*[7]/*[224]/*[4]
"guide" in rdf-dmoz.xml, xpath: /*/*[4]/d:Description
"guide" in rdf-dmoz.xml, xpath: /*/*[5]/d:Description

Nice question btw.

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
0

php simplexml:

$xml=simplexml_load_string("file1.xml");
foreach ($xml->cars->car[2] as $car) {
    // do sth with $car
}

For more, be more specific with your question, please.

michi
  • 6,565
  • 4
  • 33
  • 56