0

So I am trying to write my own scripts that will take in html files and return errors as well as clean them (doing this to learn regex and because I find it useful)

I am starting by having a quick function that takes the document, and grabs all of the tags in the correct order so I can check to make sure that they are all closed...I use the following:

>>> s = """<a>link</a>
... <div id="something">
...     <p style="background-color:#f00">paragraph</p>
... </div>"""
>>> re.findall('(?m)<.*>',s)
['<a>link</a>', '<div id="something">', '<p style="background-color:#f00">paragraph</p>', '</div>']

I understand that it grabs everything between the two carrot brackets, and that that becomes the whole line. What would I use to return the following:

['<a>','</a>', '<div id="something">', '<p style="background-color:#f00">','</p>', '</div>']
Ryan Saxe
  • 17,123
  • 23
  • 80
  • 128

3 Answers3

2
re.findall('(?m)<.*?>',s)

-- or --

re.findall('(?m)<[^>]*>',s)

The question mark after the * causes it to be a non-greedy match, meaning that it only takes as much as it needs, as opposed to normal, where it takes as much as possible.

The second form is used more often, and it uses a character class to match everything but <, since that will never exist anywhere inside the tag excepting the end.

Alex Gittemeier
  • 5,224
  • 30
  • 55
1

Although you really shouldn't be parsing HTML with regex, I understand that this is a learning exercise.

You only need to add one more character:

>>> re.findall('(?m)<.*?>',s) # See the ? after .*
['<a>', '</a>', '<div id="something">', '<p style="background-color:#f00">', '</p>', '</div>']

*? matches 0 or more of the preceeding value (in this case, .). This is a lazy match, and will match as few characters as possible.

Community
  • 1
  • 1
TerryA
  • 58,805
  • 11
  • 114
  • 143
0
re.findall('(?m)<[^<^>.]+>',s)
dilbert
  • 3,008
  • 1
  • 25
  • 34