Using lxml to parse namepaced HTML?

Question

This is driving me totally nuts, I've been struggling with it for many hours. Any help would be much appreciated.

I'm using PyQuery 1.2.9 (which is built on top of lxml) to scrape this URL. I just want to get a list of all the links in the .linkoutlist section.

This is my request in full:

response = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=The%20cost-effectiveness%20of%20mirtazapine%20versus%20paroxetine%20in%20treating%20people%20with%20depression%20in%20primary%20care')
doc = pq(response.content)
links = doc('#maincontent .linkoutlist a')
print links

But that returns an empty array. If I use this query instead:

links = doc('#maincontent .linkoutlist')

Then I get this back this HTML:

<div xmlns="http://www.w3.org/1999/xhtml" xmlns:xi="http://www.w3.org/2001/XInclude" class="linkoutlist">
   <h4>Full Text Sources</h4>
   <ul>
      <li><a title="Full text at publisher's site" href="http://meta.wkhealth.com/pt/pt-core/template-journal/lwwgateway/media/landingpage.htm?issn=0268-1315&amp;volume=19&amp;issue=3&amp;spage=125" ref="itool=Abstract&amp;PrId=3159&amp;uid=15107654&amp;db=pubmed&amp;log$=linkoutlink&amp;nlmid=8609061" target="_blank">Lippincott Williams &amp; Wilkins</a></li>
      <li><a href="http://ovidsp.ovid.com/ovidweb.cgi?T=JS&amp;PAGE=linkout&amp;SEARCH=15107654.ui" ref="itool=Abstract&amp;PrId=3682&amp;uid=15107654&amp;db=pubmed&amp;log$=linkoutlink&amp;nlmid=8609061" target="_blank">Ovid Technologies, Inc.</a></li>
   </ul>
   <h4>Other Literature Sources</h4>
   ...
</div>

So the parent selectors do return HTML with lots of <a> tags. This also appears to be valid HTML.

More experimenting reveals that lxml does not like the xmlns attribute on the opening div, for some reason.

How can I ignore this in lxml, and just parse it like regular HTML?

UPDATE: Trying ns_clean, still failing:

    parser = etree.XMLParser(ns_clean=True)
    tree = etree.parse(StringIO(response.content), parser)
    sel = CSSSelector('#maincontent .rprt_all a')
    print sel(tree)

alecxe · Accepted Answer · 2015-04-13T15:30:13.693

You need to handle namespaces, including an empty one.

Working solution:

from pyquery import PyQuery as pq
import requests


response = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=The%20cost-effectiveness%20of%20mirtazapine%20versus%20paroxetine%20in%20treating%20people%20with%20depression%20in%20primary%20care')

namespaces = {'xi': 'http://www.w3.org/2001/XInclude', 'test': 'http://www.w3.org/1999/xhtml'}
links = pq('#maincontent .linkoutlist test|a', response.content, namespaces=namespaces)
for link in links:
    print link.attrib.get("title", "No title")

Prints titles of all links matching the selector:

Full text at publisher's site
No title
Free resource
Free resource
Free resource
Free resource

Or, just set the parser to "html" and forget about namespaces:

links = pq('#maincontent .linkoutlist a', response.content, parser="html")
for link in links:
    print link.attrib.get("title", "No title")

Thanks so much. Out of interest, can you tell me why I was seeing this namespace attached to the `div` element? It's not there in the source of the page. — Richard, Apr 13 '15 at 08:41
@Richard great question which made me think that namespaces were inserted by pyquery since it tried to parse the content with xml parser, while needed to do it via html parser, please see the update. Hope that helps. — alecxe, Apr 13 '15 at 15:31

Dave Lasley · Answer 2 · 2015-04-13T00:36:35.767

Good luck getting a standard XML/DOM parse to work on most HTML. Your best bet would be to use BeautifulSoup (pip install beautifulsoup4 or easy_install beautifulsoup4), which has a lot of handling for incorrectly built structures. Maybe something like this instead?

import requests
from bs4 import BeautifulSoup

response = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=The%20cost-effectiveness%20of%20mirtazapine%20versus%20paroxetine%20in%20treating%20people%20with%20depression%20in%20primary%20care')
bs = BeautifulSoup(response.content)
div = bs.find('div', class_='linkoutlist')
links = [ a['href'] for a in div.find_all('a') ]

>>> links
['http://meta.wkhealth.com/pt/pt-core/template-journal/lwwgateway/media/landingpage.htm?issn=0268-1315&volume=19&issue=3&spage=125', 'http://ovidsp.ovid.com/ovidweb.cgi?T=JS&PAGE=linkout&SEARCH=15107654.ui', 'https://www.researchgate.net/publication/e/pm/15107654?ln_t=p&ln_o=linkout', 'http://www.diseaseinfosearch.org/result/2199', 'http://www.nlm.nih.gov/medlineplus/antidepressants.html', 'http://toxnet.nlm.nih.gov/cgi-bin/sis/search/r?dbs+hsdb:@term+@rn+24219-97-4']

I know it's not the library you were looking to use, but I have historically slammed my head into walls on many occasions when it comes to DOM. The creators of BeautifulSoup have circumvented many edge cases that tend to happen in the wild.

score 0 · Answer 3 · answered Apr 13 '15 at 00:01

0

If I remember correctly from having a similar problem myself a while ago. You can "ignore" the namespace by mapping it to None like this:

sel = CSSSelector('#maincontent .rprt_all a', namespaces={None: "http://www.w3.org/1999/xhtml"})

answered Apr 13 '15 at 00:01

Jahaja

3,222
1
20
11

Using lxml to parse namepaced HTML?

3 Answers3