1

I have the following function that finds values within a HTML DOM;

It works, but when i give parameter $value like: Levi's Baby Overall, it cracks, because it does not escape the , and ' chars

How to escape all invalid characters from DOM XPath Query?

private function extract($file,$url,$value) {
    $result = array();
    $i = 0;
    $dom = new DOMDocument();
    @$dom->loadHTMLFile($file);
    //use DOMXpath to navigate the html with the DOM
    $dom_xpath = new DOMXpath($dom);
    $elements = $dom_xpath->query("//*[text()[contains(., '" . $value . "')]]");
    if (!is_null($elements)) {
        foreach ($elements as $element) {
            $nodes = $element->childNodes;
            foreach ($nodes as $node) {
                if (($node->nodeValue != null) && ($node->nodeValue === $value)) {
                    $xpath = preg_replace("/\/text\(\)/", "", $node->getNodePath());
                    $result[$i]['url'] = $url;
                    $result[$i]['value'] = $node->nodeValue;
                    $result[$i]['xpath'] = $xpath;
                    $i++;
                }
            }
        }
    }
    return $result;
}
choroba
  • 231,213
  • 25
  • 204
  • 289
Ionut Flavius Pogacian
  • 4,750
  • 14
  • 58
  • 100

2 Answers2

1

One shouldn't substitute placeholders in an XPath expression with arbitrary, user-provided strings -- because of the risk of (malicious) XPath injection.

To deal safely with such unknown strings, the solution is to use a pre-compiled XPath expression and to pass the user-provided string as a variable to it. This also completely eliminates the need to deal with nested quotes in the code.

Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
  • the whole reason of ESCAPING the string in the first place, is to make sure it has no special meaning in xpath, both so that hackers can't inject anything to it, and that it can reliably search for exact text strings like `//parent::*[@password]` – hanshenrik Jul 21 '17 at 02:22
  • `the solution is to use a pre-compiled XPath expression and to pass the user-provided string as a variable to it.` - uhh, no, that's not a solution in PHP, as PHP's DOMXPath doesn't support variables (just another reason to avoid PHP, i guess) – hanshenrik Jul 21 '17 at 02:56
  • @hanshenrik, Yes, I don't know PHP, in .NET one can use the XsltContext class: https://msdn.microsoft.com/en-us/library/system.xml.xsl.xsltcontext(v=vs.110).aspx – Dimitre Novatchev Jul 21 '17 at 03:32
1

PHP has no built-in function for escaping/quoting strings for XPath queries. furthermore, escaping strings for XPath is surprisingly difficult to do, here's more information on why: https://stackoverflow.com/a/1352556/1067003 , and here is a PHP port of his C# XPath quote function:

function xpath_quote(string $value):string{
    if(false===strpos($value,'"')){
        return '"'.$value.'"';
    }
    if(false===strpos($value,'\'')){
        return '\''.$value.'\'';
    }
    // if the value contains both single and double quotes, construct an
    // expression that concatenates all non-double-quote substrings with
    // the quotes, e.g.:
    //
    //    concat("'foo'", '"', "bar")
    $sb='concat(';
    $substrings=explode('"',$value);
    for($i=0;$i<count($substrings);++$i){
        $needComma=($i>0);
        if($substrings[$i]!==''){
            if($i>0){
                $sb.=', ';
            }
            $sb.='"'.$substrings[$i].'"';
            $needComma=true;
        }
        if($i < (count($substrings) -1)){
            if($needComma){
                $sb.=', ';
            }
            $sb.="'\"'";
        }
    }
    $sb.=')';
    return $sb;
}

example usage:

$elements = $dom_xpath->query("//*[contains(text()," . xpath_quote($value) . ")]");
  • notice how i did not add the quoting characters (") in the xpath itself, because the xpath_quote function does it for me (or the concat() equivalent if needed)
hanshenrik
  • 19,904
  • 4
  • 43
  • 89