57

I need to find content of forms from HTML source file, I did some searching and found very good method to do that, but the problem is that it prints out only first found, how can I loop through it and output all form contents, not just first one?

line = 'bla bla bla<form>Form 1</form> some text...<form>Form 2</form> more text?'
matchObj = re.search('<form>(.*?)</form>', line, re.S)
print matchObj.group(1)
# Output: Form 1
# I need it to output every form content he found, not just first one...
Stan
  • 25,744
  • 53
  • 164
  • 242
  • 5
    You really don't want to parse HTML with regular expressions. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Wooble Oct 11 '11 at 11:06
  • Please refer this [http://stackoverflow.com/questions/3873361/finding-multiple-occurrences-of-a-string-within-a-string-in-python][1] [1]: http://stackoverflow.com/questions/3873361/finding-multiple-occurrences-of-a-string-within-a-string-in-python – avasal Oct 11 '11 at 11:07

3 Answers3

108

Do not use regular expressions to parse HTML.

But if you ever need to find all regexp matches in a string, use the findall function.

import re
line = 'bla bla bla<form>Form 1</form> some text...<form>Form 2</form> more text?'
matches = re.findall('<form>(.*?)</form>', line, re.DOTALL)
print(matches)

# Output: ['Form 1', 'Form 2']
Stan James
  • 2,535
  • 1
  • 28
  • 35
Petr Viktorin
  • 65,510
  • 9
  • 81
  • 81
  • 1
    what does the re.S do? – Charlie Parker Feb 21 '14 at 03:09
  • 3
    Makes the `'.'` special character match any character at all, including a newline; without this flag, `'.'` will match anything *except* a newline. ( http://docs.python.org/2/library/re.html#re.S ) – Petr Viktorin Feb 21 '14 at 09:03
  • Oh, I see, I did go to the webpage but didn't understand the documentation because nothing was underneath re.S but now I see how to read the documentation, re.S and re.DOTALL are the same...thanks! – Charlie Parker Feb 21 '14 at 16:55
  • You're welcome! `re.DOTALL` is more clear, I've updated the answer. – Petr Viktorin Feb 22 '14 at 23:06
  • This is the best method. Just to confirm, as findall returns a normal array, access results with matches[0], matches[1], etc – moyo Nov 25 '21 at 12:00
40

Instead of using re.search use re.findall it will return you all matches in a List. Or you could also use re.finditer (which i like most to use) it will return an Iterator Object and you can just use it to iterate over all found matches.

line = 'bla bla bla<form>Form 1</form> some text...<form>Form 2</form> more text?'
for match in re.finditer('<form>(.*?)</form>', line, re.S):
    print match.group(1)
Aamir Rind
  • 38,793
  • 23
  • 126
  • 164
  • 1
    what does the re.S do? – Charlie Parker Feb 21 '14 at 03:16
  • 1
    `re.finditer` is exactly what I needed!Thanks! – shellbye Apr 25 '16 at 07:06
  • 1
    @Pinocchio docs say: re.S is the same as re.DOTALL ``Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.`` (posted this because I believe people like me often come to stackoverflow.com to find answers quickly) – Anton Jun 08 '17 at 11:25
  • 1
    I'm not sure why this isn't the accepted answer - `finditer` is correct. `re.search()` returns a match object. `re.finditer()` returns an iterable of all the found match objects (wrap in `list()` to just iterate the whole thing). `findall()` returns just a list of tuples. Naming `finditer` like `findall` was a poor choice - it matches more with `search` and should have been called `searchiter`. – mattmc3 Aug 08 '22 at 15:29
  • Yep, agree with matt. This is the best answer. The `finditer()` [match objects](https://docs.python.org/3/library/re.html#match-objects) are more powerful than the list `findall()` returns because it lets you handle each match group explicitly and get other metadata about the match(es). – Andrew Jun 22 '23 at 17:27
6

Using regexes for this purpose is the wrong approach. Since you are using python you have a really awesome library available to extract parts from HTML documents: BeautifulSoup.

ThiefMaster
  • 310,957
  • 84
  • 592
  • 636