0

So I have this HTML:

div class="price" itemprop="offers" itemscope itemtype="http://schema.org Offer"

And I'm trying to split it in a list something like this:

[class="price", itemprop="offers", itemscope, itemtype="http://schema.org Offer"]

But I'm nost sure how to split the part of itemscope.

My current regex it looks like this (\s.*?\"\s*.*?\s*\"), but the problem with this one is that when I will split it into a list, the itemscope and itemtype="http://schema.org Offer" will be just one element, so my list will be something like this:

[class="price", itemprop="offers", itemscope itemtype="http://schema.org Offer"]

Any idea how can I fix this?

Vali
  • 629
  • 2
  • 6
  • 14
  • 2
    [I wouldn't recommend regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). Use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) instead. – TrebledJ Dec 02 '18 at 12:14
  • I'm using already BS for something else. What I'm trying to do it here is to convert an HTML tag like that one into a XPath in order to automatize something. And in order to do that I need to split that HTML tag – Vali Dec 02 '18 at 12:15
  • 2
    You can get a list of attributes in BeautifulSoup, see this [answer](https://stackoverflow.com/questions/36597494/beautiful-soup-list-all-attributes). – Liinux Dec 02 '18 at 12:30
  • See this question for a discussion of why regex is not the best tool: https://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la – dmmfll Dec 02 '18 at 13:27

2 Answers2

1

The lxml package offers some nice ways for dealing with xpaths and attributes on HTML elements.

Here is an example:

from io import StringIO
from lxml import etree

html = '<div class="price" itemprop="offers" itemscope itemtype="http://schema.org Offer"></div>'

tree = etree.parse(StringIO(html), etree.HTMLParser())
doc = tree.getroot()

xpaths = [tree.getpath(element) for element in doc.iter()]

print(xpaths)

attributes_ = ([(f'@{att}', node.attrib[att]) for att in node.attrib]
               for node in doc.iter())
attributes = [item for item in attributes_ if item]
print(attributes)

OUTPUT:

['/html', '/html/body', '/html/body/div']

[[('@class', 'price'), ('@itemprop', 'offers'), ('@itemscope', ''), ('@itemtype', 'http://schema.org Offer')]]

dmmfll
  • 2,666
  • 2
  • 35
  • 41
0

If you do not want to use Beautiful Soup, Python includes the html.parser module that has an HTML parser. Here is an example of how to use it.

(I changed the example HTML into a properly defined div.)

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    data = dict()

    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)
        for class_name, value in attrs:
            print(f'{class_name}: {value}')
            self.data[class_name] = value

    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)

    def handle_data(self, data):
        print("Encountered some data  :", data)

parser = MyHTMLParser()
html = '<div class="price" itemprop="offers" itemscope itemtype="http://schema.org Offer"></div>'
parser.feed(html)
print(parser.data)

OUTPUT:

Encountered a start tag: div

class: price

itemprop: offers

itemscope: None

itemtype: http://schema.org Offer

Encountered an end tag : div

{'class': 'price', 'itemprop': 'offers', 'itemscope': None, 'itemtype': 'http://schema.org Offer'}

dmmfll
  • 2,666
  • 2
  • 35
  • 41