Regexp to find content of HTML form

Question

I'm having troubles finding the content of HTML forms (or any other tag for that matter). I've tried

    forms = re.findall('<form.*/form>', htmltext)

but with no results. Where's the mistake?

You'd be far better of using a HTML parser; BeautifulSoup is excellent. — Martijn Pieters, Jun 03 '14 at 14:35
Thanks to both for the advice. I still don't understand why the regexp isn't working though. — AnotherUser, Jun 03 '14 at 14:39
Never ever ever ever parse html with regex http://blog.codinghorror.com/parsing-html-the-cthulhu-way/ — That1Guy, Jun 03 '14 at 14:39
Please read http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — timgeb, Jun 03 '14 at 14:41
Thanks, those were real eye-opener! But what if the line I posted above (corrected of course) is the only parsing I need in a program? Is is still worth it to import external libraries or use many more lines of code of e.g. HTMLParser? — AnotherUser, Jun 03 '14 at 14:48

score 0 · Answer 1 · answered Jun 03 '14 at 14:41

0

Unless the form was on one line, that won't work, you need re.DOTALL as an option

forms = re.findall('<form.*/form>', htmltext, re.DOTALL)

You could use re.IGNORECASE | re.DOTALL in case you need to catch something like <Form ...

answered Jun 03 '14 at 14:41

MiquelFire

1 Answers1