I am trying to scrap Wikipedia page with plain php and have been using xpath->query
to search the dom. I am trying to select the node which has text Known for
on this Wikipedia page https://en.wikipedia.org/wiki/Ajmal_Kasab The text is in the right hand side table before the text 2008 Mumbai attacks
. I loaded the page with DOMDocument::loadHtml
, and did the following:
var_dump( $value->saveHTML($xpath->query( "//table[@class[contains(.,'infobox')]]//tr[th='Known for']/th/text()" )[0]) );
I tried Known\x20for
, Known for
and Known for
etc. But they didn't work. Fortunately I stumbled upon this Using XPATH to search text containing post and tried manually pressing Alt + 0160
on my windows 10 pc in sublime 3 editor. The expression looks like this Known<0xa0>for
-- it worked.
My question 1 is why in the world won't xpath accept a normal space
or the literal  
? The Wikipedia page source has it as Known for
. What if I had Linux or a different text editor? Currently, I am working locally, would it work on my Linux based server as well? What is the computer science behind this?
Secondly I need to convert xpath
result set, which contains spaces into a php varable which stores <0xa0>
. I have:
$tmp = $xpath->query("//table[@class[contains(.,'infobox')]]//tr[th='Known<0xa0>for']/th/text()");
$tmp = $domDomoc->saveHTML($tmp[0]);
$result = $xpath->query("//table[@class[contains(.,'infobox')]]//tr[th='{$tmp}']/td/text()");
Seems like the variable $tmp
doesn't hold to <0xa0>
and in turn $result
is incorrect(false).
The whole php code is more complex and the to-be-searched words are a lot. So I have boiled the code down to a simpler task. Words like Known for
are dynamic and fed into a function.