1

Is it possible to access the data-* portion of an html element from python? I'm using scrapy and the data-* is not available in a selector object, though the raw data is available in a Request object.

If I dump the html using wget -O page http://page.com then I can see the data in the file. It's something like <a href="blah" data-mine="a;slfkjasd;fklajsdfl;ahsdf">blahlink</a>

I can edit the data-mine portion in an editor, so I know it's there ... it just seems like well-behaved parsers are dropping it.

As you can see, I'm confused.

3 Answers3

1

Yeah, lxml does not expose the attribute names for some reason, and Talvalin is right, html5lib does:

stav@maia:~$ python
Python 2.7.3 (default, Aug  1 2012, 05:14:39) [GCC 4.6.3] on linux2
>>> import html5lib
>>> html = '''<a href="blah" target="_blank" data-mine="a;slfkjasd;fklajsdfl;ahsdf"
... data-yours="truly">blahlink</a>'''
>>> for x in html5lib.parse(html, treebuilder='lxml').xpath('descendant::*/@*'):
...     print '%s = "%s"' % (x.attrname, x)
...
href = "blah"
target = "_blank"
data-mine = "a;slfkjasd;fklajsdfl;ahsdf"
data-yours = "truly"
Steven Almeroth
  • 7,758
  • 2
  • 50
  • 57
1

I did it like this without using a third-party library:

import re
data_email_pattern = re.compile(r'data-email="([^"]+)"')
match = data_email_pattern.search(response.body)
if match:
    print(match.group(1))
    ...
  • 2
    Beware the [Zalgo](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – Steven Almeroth Feb 16 '13 at 16:22
0

I've not tried it, but there is html5lib (http://code.google.com/p/html5lib/) which can be used in conjunction with Beautiful Soup instead of scrapy's built-in selectors.

Talvalin
  • 7,789
  • 2
  • 30
  • 40
  • Having said that, if you could provide a link to the page you're trying to scrape then I'll happily test it out now. :) – Talvalin Feb 15 '13 at 11:42