Using regex to extract all the html attrs

Question

I want to use re module to extract all the html nodes from a string, including all their attrs. However, I want each attr be a group, which means I can use matchobj.group() to get them. The number of attrs in a node is flexiable. This is where I am confused. I don't know how to write such a regex. I have tried </?(\w+)(\s\w+[^>]*?)*/?>' but for a node like <a href='aaa' style='bbb'> I can only get two groups with [('a'), ('style="bbb")].
I know there are some good HTML parsers. But actually I am not going to extract the values of the attrs. I need to modify the raw string.

Consider using HTML parsers instead of Regex. http://www.crummy.com/software/BeautifulSoup/ — Achrome, Jun 28 '13 at 01:53

score 2 · Answer 1 · edited May 23 '17 at 11:56

2

Please don't use regex. Use BeautifulSoup:

>>> from bs4 import BeautifulSoup as BS
>>> html = """<a href='aaa' style='bbb'>"""
>>> soup = BS(html)
>>> mytag = soup.find('a')
>>> print mytag['href']
aaa
>>> print mytag['style']
bbb

Or if you want a dictionary:

>>> print mytag.attrs
{'style': 'bbb', 'href': 'aaa'}

edited May 23 '17 at 11:56

Community

1
1

answered Jun 28 '13 at 01:56

TerryA

58,805
11
114
143

I know HTML parsers should be good choices but actually I don't think they can work for me. I need to modify the raw string. – zhangyangyu Jun 28 '13 at 02:05
@zhangyangyu Take a look at [this](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#replace-with) perhaps – TerryA Jun 28 '13 at 02:14

score 1 · Accepted Answer · answered Jun 28 '13 at 03:02

Description

To capture an infinite number of attributes it would need to be a two step process, where first you pull the entire element. Then you'd iterate through the elements and get an array of matched attributes.

regex to grab all the elements: <\w+(?=\s|>)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?>

enter image description here

regex to grab all the attributes from a single element: \s\w+=(?:'[^']*'|"[^"]*"|[^'"][^\s>]*)(?=\s|>)

enter image description here

Python Example

See working example: http://repl.it/J0t/4

Code

import re

string = """
<a href="i.like.kittens.com" NotRealAttribute=' true="4>2"' class=Fonzie>text</a>
""";

for matchElementObj in re.finditer( r'<\w+(?=\s|>)(?:[^>=]|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?>', string, re.M|re.I|re.S):
    print "-------"
    print "matchElementObj.group(0) : ", matchElementObj.group(0)

    for matchAttributesObj in re.finditer( r'\s\w+=(?:\'[^\']*\'|"[^"]*"|[^\'"][^\s>]*)(?=\s|>)', string, re.M|re.I|re.S):
        print "matchAttributesObj.group(0) : ", matchAttributesObj.group(0)

Output

-------
matchElementObj.group(0) :  <a href="i.like.kittens.com" NotRealAttribute=' true="4>2"' class=Fonzie>
matchAttributesObj.group(0) :   href="i.like.kittens.com"
matchAttributesObj.group(0) :   NotRealAttribute=' true="4>2"'
matchAttributesObj.group(0) :   class=Fonzie

Using regex to extract all the html attrs

2 Answers2

Description

Python Example