scrapy response.xpath returns empty array on xml document with default namespace, while response.re works

Question

I am new to scrappy and I was playing with the scrapy shell trying to crawl this site: www.spiegel.de/sitemap.xml

I did it with

scrapy shell "http://www.spiegel.de/sitemap.xml"

and it works all fine, when i use

response.body

i can see the whole page including xml tags

however for instance this:

response.xpath('//loc')

simply wont work.

The result i get is an empty array

while

response.selector.re('somevalidregexpexpression')

would work

any idea what could be the reason? could be related to encoding or so? the site is not utf-8

I am using python 2.7 on Win 7. I tried the xpath() on another site (dmoz) and it worked fine.

score 31 · Accepted Answer · answered Mar 26 '16 at 00:44

The problem was due to the default namespace declared at the root element of the XML :

xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"

So in that XML, the root element and its descendants without prefix inherits the same namespace, implicitly.

On the other hand, in XPath, you need to use prefix that bound to a namespace URI to reference element in that namespace, there is no such default namespace implied.

You can use selector.register_namespace() to bind a namespace prefix to the default namespace URI, and then use the prefix in your XPath :

response.selector.register_namespace('d', 'http://www.sitemaps.org/schemas/sitemap/0.9')
response.xpath('//d:loc')

@har07 you absolute life saver! – Daniel Dewhurst Aug 09 '18 at 10:07 — Daniel Dewhurst, Aug 09 '18 at 10:07

Rabih Kodeih · Answer 2 · 2018-10-23T09:11:24.233

4

You can also use xpath with local namespace such as in:

response.xpath("//*[local-name()='loc']")

This is especially useful if you are parsing responses from multiple heterogeneous sources and you don't want to register each and every namespace.

edited Oct 23 '18 at 09:11

answered Oct 23 '18 at 08:51

Rabih Kodeih

9,361
11
47
55

scrapy response.xpath returns empty array on xml document with default namespace, while response.re works

2 Answers2