Split HTML by Regex

Question

So I have this HTML:

div class="price" itemprop="offers" itemscope itemtype="http://schema.org Offer"

And I'm trying to split it in a list something like this:

[class="price", itemprop="offers", itemscope, itemtype="http://schema.org Offer"]

But I'm nost sure how to split the part of itemscope.

My current regex it looks like this (\s.*?\"\s*.*?\s*\"), but the problem with this one is that when I will split it into a list, the itemscope and itemtype="http://schema.org Offer" will be just one element, so my list will be something like this:

[class="price", itemprop="offers", itemscope itemtype="http://schema.org Offer"]

Any idea how can I fix this?

[I wouldn't recommend regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). Use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) instead. — TrebledJ, Dec 02 '18 at 12:14
I'm using already BS for something else. What I'm trying to do it here is to convert an HTML tag like that one into a XPath in order to automatize something. And in order to do that I need to split that HTML tag — Vali, Dec 02 '18 at 12:15
You can get a list of attributes in BeautifulSoup, see this [answer](https://stackoverflow.com/questions/36597494/beautiful-soup-list-all-attributes). — Liinux, Dec 02 '18 at 12:30
See this question for a discussion of why regex is not the best tool: https://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la — dmmfll, Dec 02 '18 at 13:27

dmmfll · Accepted Answer · 2018-12-02T13:19:47.603

The lxml package offers some nice ways for dealing with xpaths and attributes on HTML elements.

Here is an example:

from io import StringIO
from lxml import etree

html = '<div class="price" itemprop="offers" itemscope itemtype="http://schema.org Offer"></div>'

tree = etree.parse(StringIO(html), etree.HTMLParser())
doc = tree.getroot()

xpaths = [tree.getpath(element) for element in doc.iter()]

print(xpaths)

attributes_ = ([(f'@{att}', node.attrib[att]) for att in node.attrib]
               for node in doc.iter())
attributes = [item for item in attributes_ if item]
print(attributes)

OUTPUT:

['/html', '/html/body', '/html/body/div']

[[('@class', 'price'), ('@itemprop', 'offers'), ('@itemscope', ''), ('@itemtype', 'http://schema.org Offer')]]

score 0 · Answer 2 · answered Dec 02 '18 at 12:45

If you do not want to use Beautiful Soup, Python includes the html.parser module that has an HTML parser. Here is an example of how to use it.

(I changed the example HTML into a properly defined div.)

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    data = dict()

    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)
        for class_name, value in attrs:
            print(f'{class_name}: {value}')
            self.data[class_name] = value

    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)

    def handle_data(self, data):
        print("Encountered some data  :", data)

parser = MyHTMLParser()
html = '<div class="price" itemprop="offers" itemscope itemtype="http://schema.org Offer"></div>'
parser.feed(html)
print(parser.data)

OUTPUT:

Encountered a start tag: div

class: price

itemprop: offers

itemscope: None

itemtype: http://schema.org Offer

Encountered an end tag : div

{'class': 'price', 'itemprop': 'offers', 'itemscope': None, 'itemtype': 'http://schema.org Offer'}

Yeah... except it's a lot more work and not as robust... You might want to consider storing `data` on the object instance and not the class - might lead to surprises later on if it's re-used otherwise... — Jon Clements, Dec 02 '18 at 12:56
Please define "not as robust". Are you referring to the handling of broken HTML? — dmmfll, Dec 02 '18 at 13:12

Split HTML by Regex

2 Answers2