file cleaner using regex

Question

So I am trying to write my own scripts that will take in html files and return errors as well as clean them (doing this to learn regex and because I find it useful)

I am starting by having a quick function that takes the document, and grabs all of the tags in the correct order so I can check to make sure that they are all closed...I use the following:

>>> s = """<a>link</a>
... <div id="something">
...     <p style="background-color:#f00">paragraph</p>
... </div>"""
>>> re.findall('(?m)<.*>',s)
['<a>link</a>', '<div id="something">', '<p style="background-color:#f00">paragraph</p>', '</div>']

I understand that it grabs everything between the two carrot brackets, and that that becomes the whole line. What would I use to return the following:

['<a>','</a>', '<div id="something">', '<p style="background-color:#f00">','</p>', '</div>']

Alex Gittemeier · Answer 1 · 2013-07-03T06:16:33.317

2

re.findall('(?m)<.*?>',s)

-- or --

re.findall('(?m)<[^>]*>',s)

The question mark after the * causes it to be a non-greedy match, meaning that it only takes as much as it needs, as opposed to normal, where it takes as much as possible.

The second form is used more often, and it uses a character class to match everything but <, since that will never exist anywhere inside the tag excepting the end.

edited Jul 03 '13 at 06:16

answered Jul 03 '13 at 06:11

Alex Gittemeier

5,224
30
55

1

+1 agreed, this is a better answer because it actually explains the issue. – Trent Jul 03 '13 at 06:25

score 1 · Accepted Answer · edited May 23 '17 at 10:32

1

Although you really shouldn't be parsing HTML with regex, I understand that this is a learning exercise.

You only need to add one more character:

>>> re.findall('(?m)<.*?>',s) # See the ? after .*
['<a>', '</a>', '<div id="something">', '<p style="background-color:#f00">', '</p>', '</div>']

*? matches 0 or more of the preceeding value (in this case, .). This is a lazy match, and will match as few characters as possible.

edited May 23 '17 at 10:32

Community

1
1

answered Jul 03 '13 at 06:12

TerryA

58,805
11
114
143

Great, thanks! And yes, I have seen that post, but in order to learn, and check small files for bugs, I think it's appropriate and the best method ahah – Ryan Saxe Jul 03 '13 at 06:22
@RyanSaxe Yep! That's fine :) – TerryA Jul 03 '13 at 06:24

score 0 · Answer 3 · answered Jul 03 '13 at 06:13

0

re.findall('(?m)<[^<^>.]+>',s)

answered Jul 03 '13 at 06:13

dilbert

3,008
1
25
34

file cleaner using regex

3 Answers3