How can I remove bad data in an XPath element using Python?

Question

I have this short example to demonstrate my problem:

from lxml import html

post = """<p>This a page with URLs.
<a href="http://google.com">This goes to&#xA; Google</a><br/>
<a href="http://yahoo.com">This &#xA; goes to Yahoo!</a><br/>
<a&#xA;href="http://example.com">This is invalid due to that&#xA;line feed character</p>&#xA;"""

doc = html.fromstring(post)
for link in doc.xpath('//a'):
    print link.get('href')

This outputs:

http://google.com
http://yahoo.com
None

The problem is that my data has 
 characters embedded in it. For my last link, it is embedded directly between the anchor and the href attribute. The line feeds outside of the elements are important to me.

doc.xpath('//a') correctly saw the <a
href="http://example.com"> as a link, but it can't access the href attribute when I do link.get('href').

How can I clean the data if link.get('href') returns None, so that I can still retrieve the discovered href attribute?

I can't strip all of the 
 characters from the entire post element as the ones in the text are important.

see there is no ending tag for element "a" [] for last "a" tag, so if it is not then u will not get the required output — crax, Jul 06 '15 at 13:01
Since the third `` is considered corrupted during parsing, I believe **lxml** will tolerate but discard what we considered attributes "href", and even if you do a `html.tostring(...)` you cannot recover that info, similar to just try parsing with `lxml.etree.fromstring` and you will get an error. I could be wrong though... to make it work, I think you might need to cleanse your source before parsing — Anzel, Jul 06 '15 at 13:04
@C.R.Sharat, I tested that theory by removing the ending `` from the Yahoo! link. It still returns the value for Yahoo! as expected. — NewGuy, Jul 06 '15 at 13:09
@Anzel, So that I understand, you are saying that I need to either parse twice (if an error occurs like this), or pre-parse - stripping these characters if and only if they are in an HTML element - then parse as I am now? — NewGuy, Jul 06 '15 at 13:10
@NewGuy, I'd say you may better strip those linefeed characters before parsing with **lxml** as in my opinion the source is corrupted. It's just that **lxml** can tolerate and discard the bad attributes somehow to make it "look" OK. — Anzel, Jul 06 '15 at 13:13

score 1 · Accepted Answer · edited May 23 '17 at 12:14

Module unidecode

Since you need the data outside of the tags, you could try using unidecode. It doesn't tackle Chinese and Korean, but it'll do things like change left and right quotes to ASCII quotes. It should help with these 
 characters as well, changing them to spaces instead of non-breaking spaces. Hopefully that's all you need in regards to preserving the other data. str.replace(u"\#xa", u" ") is less heavy handed if the ascii space is okay.

import unidecode, urllib2
from lxml import html

html_text = urllib2.urlopen("http://www.yourwebsite.com")
ascii_text = unidecode.unidecode(html_text)
html.fromstring(ascii_text)

Explanation of issue

There seems to be a known issue with this in several versions of Python. And it's C# as well. A related closed issue seems to indicate that the issue was closed because XML attribute tags aren't built to support carriage returns, so escaping it in all xml contexts would be silly. As it turns out, the W3C spec requires that the unicode be put in when parsing (see sec. 1).

All line breaks must have been normalized on input to #xA as described in 2.11 End-of-Line Handling, so the rest of this algorithm operates on text normalized in this way.

score 1 · Answer 2 · answered Jul 06 '15 at 15:36

You may solve your specific problem with:

post = post.replace('&#xA;', '\n')

Resulting test program:

from lxml import html

post = """<p>This a page with URLs. 
<a href="http://google.com">This goes to&#xA; Google</a><br/>
<a href="http://yahoo.com">This &#xA; goes to Yahoo!</a><br/>
<a&#xA;href="http://example.com">This is invalid due to that&#xA;line feed character</p>&#xA;"""

post = post.replace('&#xA;', '\n')

doc = html.fromstring(post)
for link in doc.xpath('//a'):
    print link.get('href')

Output:

http://google.com
http://yahoo.com
http://example.com

How can I remove bad data in an XPath element using Python?

2 Answers2

Module unidecode

Explanation of issue