I'm not sure if I am stating the obvious...
What XPATH selectors are really crawling through is the structured text, which requires fairly clean XML to work properly.
Lots of XML by nature is not very human-readable, thus the high level of order and the coloring to help our eyes follow the nested levels of tags:
<div></div>
An XPath query does not pay attention to what is in between tags, but rather the tags themselves (type, attributes, so on...). So if you crawl clean HTML or XML, it doesnt matter how deep or how far away, it will land you on the tag set you are aiming for (then you will likely want to handle the contents yourself)
Well formed XML is usually required to have at least one set of root tags at minimum. So the shortest bit you should see is...
<html>
<div>
1
</div>
<div>
2
</div>
<div>
<h1>Hello</h1>
</div>
</html>
So
for sel in response.xpath('//'):
should iterate all 3 , and
for sel in response.xpath('//div//h1'):
would STEP INTO only the very last and would STEP ON the tag, where you could then ready its contents if you wanted.
Second, HTML and XML actually dont give much credance to whitespace (even though your example looked pretty, that was for your benefit, not the benefit of your code). Python can likewise be told to treat blank lines and single spaces as the same thing (your XPath query should skip whitespace by default).
Edit:
As for encoded entities, such as
, most html packages have an htmlEntityDecode function, as those symbols can cause pain in other areas. You would want to Decode the entities into their normal character which are often whitespace, left-bracket, right-bracket, and so on...