I have this short example to demonstrate my problem:
from lxml import html
post = """<p>This a page with URLs.
<a href="http://google.com">This goes to
 Google</a><br/>
<a href="http://yahoo.com">This 
 goes to Yahoo!</a><br/>
<a
href="http://example.com">This is invalid due to that
line feed character</p>
"""
doc = html.fromstring(post)
for link in doc.xpath('//a'):
print link.get('href')
This outputs:
http://google.com
http://yahoo.com
None
The problem is that my data has 

characters embedded in it. For my last link, it is embedded directly between the anchor and the href attribute. The line feeds outside of the elements are important to me.
doc.xpath('//a')
correctly saw the <a
href="http://example.com">
as a link, but it can't access the href
attribute when I do link.get('href')
.
How can I clean the data if link.get('href')
returns None
, so that I can still retrieve the discovered href
attribute?
I can't strip all of the 

characters from the entire post
element as the ones in the text are important.