I want to use re
module to extract all the html nodes from a string, including all their attrs. However, I want each attr be a group, which means I can use matchobj.group()
to get them. The number of attrs in a node is flexiable. This is where I am confused. I don't know how to write such a regex. I have tried </?(\w+)(\s\w+[^>]*?)*/?>'
but for a node like <a href='aaa' style='bbb'>
I can only get two groups with [('a'), ('style="bbb")]
.
I know there are some good HTML parsers. But actually I am not going to extract the values of the attrs. I need to modify the raw string.
Asked
Active
Viewed 422 times
1

zhangyangyu
- 8,520
- 2
- 33
- 43
-
1FFS... http://www.crummy.com/software/BeautifulSoup/ – Ignacio Vazquez-Abrams Jun 28 '13 at 01:53
-
Consider using HTML parsers instead of Regex. http://www.crummy.com/software/BeautifulSoup/ – Achrome Jun 28 '13 at 01:53
-
Normal the first match is overwritten by the second. – Casimir et Hippolyte Jun 28 '13 at 01:54
-
Why do you need to modify the raw string? – icedwater Jun 28 '13 at 03:03
2 Answers
2
Please don't use regex. Use BeautifulSoup
:
>>> from bs4 import BeautifulSoup as BS
>>> html = """<a href='aaa' style='bbb'>"""
>>> soup = BS(html)
>>> mytag = soup.find('a')
>>> print mytag['href']
aaa
>>> print mytag['style']
bbb
Or if you want a dictionary:
>>> print mytag.attrs
{'style': 'bbb', 'href': 'aaa'}
-
I know HTML parsers should be good choices but actually I don't think they can work for me. I need to modify the raw string. – zhangyangyu Jun 28 '13 at 02:05
-
@zhangyangyu Take a look at [this](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#replace-with) perhaps – TerryA Jun 28 '13 at 02:14
1
Description
To capture an infinite number of attributes it would need to be a two step process, where first you pull the entire element. Then you'd iterate through the elements and get an array of matched attributes.
regex to grab all the elements: <\w+(?=\s|>)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?>
regex to grab all the attributes from a single element: \s\w+=(?:'[^']*'|"[^"]*"|[^'"][^\s>]*)(?=\s|>)
Python Example
See working example: http://repl.it/J0t/4
Code
import re
string = """
<a href="i.like.kittens.com" NotRealAttribute=' true="4>2"' class=Fonzie>text</a>
""";
for matchElementObj in re.finditer( r'<\w+(?=\s|>)(?:[^>=]|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?>', string, re.M|re.I|re.S):
print "-------"
print "matchElementObj.group(0) : ", matchElementObj.group(0)
for matchAttributesObj in re.finditer( r'\s\w+=(?:\'[^\']*\'|"[^"]*"|[^\'"][^\s>]*)(?=\s|>)', string, re.M|re.I|re.S):
print "matchAttributesObj.group(0) : ", matchAttributesObj.group(0)
Output
-------
matchElementObj.group(0) : <a href="i.like.kittens.com" NotRealAttribute=' true="4>2"' class=Fonzie>
matchAttributesObj.group(0) : href="i.like.kittens.com"
matchAttributesObj.group(0) : NotRealAttribute=' true="4>2"'
matchAttributesObj.group(0) : class=Fonzie

Ro Yo Mi
- 14,790
- 5
- 35
- 43