0

I am trying to scrap Wikipedia page with plain php and have been using xpath->query to search the dom. I am trying to select the node which has text Known for on this Wikipedia page https://en.wikipedia.org/wiki/Ajmal_Kasab The text is in the right hand side table before the text 2008 Mumbai attacks. I loaded the page with DOMDocument::loadHtml, and did the following:

var_dump( $value->saveHTML($xpath->query( "//table[@class[contains(.,'infobox')]]//tr[th='Known for']/th/text()" )[0])  ); 

I tried Known\x20for, Known&nbsp;for and Known&#160;for etc. But they didn't work. Fortunately I stumbled upon this Using XPATH to search text containing &nbsp; post and tried manually pressing Alt + 0160 on my windows 10 pc in sublime 3 editor. The expression looks like this Known<0xa0>for -- it worked.

My question 1 is why in the world won't xpath accept a normal space or the literal &#160;? The Wikipedia page source has it as Known&#160;for. What if I had Linux or a different text editor? Currently, I am working locally, would it work on my Linux based server as well? What is the computer science behind this?

Secondly I need to convert xpath result set, which contains spaces into a php varable which stores <0xa0>. I have:

$tmp = $xpath->query("//table[@class[contains(.,'infobox')]]//tr[th='Known<0xa0>for']/th/text()");
$tmp = $domDomoc->saveHTML($tmp[0]);
$result = $xpath->query("//table[@class[contains(.,'infobox')]]//tr[th='{$tmp}']/td/text()");

Seems like the variable $tmp doesn't hold to <0xa0> and in turn $result is incorrect(false).

The whole php code is more complex and the to-be-searched words are a lot. So I have boiled the code down to a simpler task. Words like Known for are dynamic and fed into a function.

user31782
  • 7,087
  • 14
  • 68
  • 143
  • 1
    ` ` is not the same as the [Unicode non-breaking space](https://en.wikipedia.org/wiki/Non-breaking_space), and the latter is not the same as a traditional space. You could try some of the [text normalization techniques](https://stackoverflow.com/a/24368657/231316) or possibly [RegEx](https://stackoverflow.com/a/394032/231316) – Chris Haas Dec 18 '21 at 13:52

2 Answers2

1

You claim "The Wikipedia page source has it as Known&#160for" which is not true at all, it has Known&#160;for. Secondly you call &#160 a literal, even if you meant &#160;, that is not a literal, it is a HTML numeric character reference, i.e. an escaping mechanism HTML has to not use a literal character. Of course your XPath doesn't work on the HTML source code, you have feed your string to the loadHtml method which uses an HTML parser to parse the HTML source string, so the resulting DOM tree certainly doesn't have any representation of the form &#160; or &nbnsp;, it just has a text node with Unicode characters, one of them being the character with decimal Unicode 160 or the hexadecimal U00A0.

Neither XPath nor PHP require you to escape that character in a PHP string literal (https://www.php.net/manual/en/language.types.string.php) as <0xa0>, it should be \xA0.

For the second part of the question, what kind of value do you expect to get from $xpath->query("//table[@class[contains(.,'infobox')]]//tr[th='Known<0xa0>for']/th/text()")? A DOM node list? What do you expect to achieve by putting that variable into another PHP string literal in the $xpath->query("//table[@class[contains(.,'infobox')]]//tr[th='{$tmp}']/td/text()")?

If you want a PHP string from an XPath evaluation use an expression which doesn't return nodes but a string (string(//th) would return a string with the string value of the first th element) and use the evaluate method, not the query method e.g.

$doc = new DOMDocument();
$doc->loadHTML(file_get_contents('https://en.wikipedia.org/wiki/Ajmal_Kasab'));
$xpath  = new DOMXPath($doc);
$value = $xpath->evaluate("string(//tr[th = 'Known\u{00A0}for']/td)");
echo $value;
Martin Honnen
  • 160,499
  • 6
  • 90
  • 110
  • #1 ` ` was a typo, I meant ` `. #2 typing `\xA0` in the expression to be query did not work, rather `\xc2\xa0` worked(may be because my php files are being saved in utf-8 format). #3 I fixed the code in my _question-2_. #4..[to be cont...] – user31782 Dec 19 '21 at 10:42
  • #4 Heres my understanding, `$xpath = new DomXPath(" "); $xpath->query("\xc2\xa0")` first grabs the html output which has been parsed and hence converted into utf characters string. Now, `query` tries to match that html output with my expression. So my expression should be same as an expected browser's html output, i.e. a utf characters string. – user31782 Dec 19 '21 at 10:53
  • It is still not clear what you want to achieve, I guess you want to identify the `tr` which has the `th` with `Known for` and then access the sibling `td` cell. But that has no text node child, rather it contains a link. So it seems using a single expression to select the `//tr[th='Known\u{00A0}for']/td` suffices to give you the `td`. – Martin Honnen Dec 19 '21 at 10:57
  • I achieved the second part in a different way. I have a function which takes `Known\xc2\xa0for`, `Born` and `Died` etc as arguments and then returns their corresponding `td` values. I simply used `$intro["td"] = $intro["th"][0]->nextSibling;`. Initially(and in my question) I was using `$domDomoc->saveHTML($tmp[0]);` this, but it seems like `saveHTML` converts `\xc2\xa0` into normal space utf character. – user31782 Dec 19 '21 at 11:10
0

XPath is designed to be hosted in other programming languages (PHP in your case) and rather than having an escaping convention of its own, it relies on the escaping conventions of the host language. So you enter a NBSP (xa0) character in the XPath expression the same way as you would enter it in any other PHP string literal, for example \xA0.

&#xa0; would be appropriate when XPath is hosted in XML, or &nbsp; when it is hosted in HTML, but not when it is hosted in PHP.

You ask "what is the computer science behind this?". Basically, it's to avoid the double-escaping problem. When a sublanguage such as regex has an escape convention (e.g. \\ to represent \) and is then hosted in another language with a similar escape convention, you end up having to write \ as \\\\ (or & as &amp;amp;). Since XPath was designed explicitly for hosting within other languages, they decided to use the host-language escaping capabilities rather than superimpose their own.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • _So you enter a NBSP (xa0) character in the XPath expression the same way as you would enter it in any other PHP string literal_. Ok, I can write NBSP in php simply by pressing space, i.e. ` `, but this is not working in the xpath expression. – user31782 Dec 18 '21 at 17:42
  • 1
    Pressing the space key surely gives you an ordinary space, not a non-breaking space. – Michael Kay Dec 19 '21 at 12:47