Python - Using regex to find multiple matches and print them out

Question

I need to find content of forms from HTML source file, I did some searching and found very good method to do that, but the problem is that it prints out only first found, how can I loop through it and output all form contents, not just first one?

line = 'bla bla bla<form>Form 1</form> some text...<form>Form 2</form> more text?'
matchObj = re.search('<form>(.*?)</form>', line, re.S)
print matchObj.group(1)
# Output: Form 1
# I need it to output every form content he found, not just first one...

You really don't want to parse HTML with regular expressions. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Wooble, Oct 11 '11 at 11:06
Please refer this [http://stackoverflow.com/questions/3873361/finding-multiple-occurrences-of-a-string-within-a-string-in-python][1] [1]: http://stackoverflow.com/questions/3873361/finding-multiple-occurrences-of-a-string-within-a-string-in-python — avasal, Oct 11 '11 at 11:07

score 108 · Accepted Answer · edited May 25 '19 at 23:26

108

Do not use regular expressions to parse HTML.

But if you ever need to find all regexp matches in a string, use the findall function.

import re
line = 'bla bla bla<form>Form 1</form> some text...<form>Form 2</form> more text?'
matches = re.findall('<form>(.*?)</form>', line, re.DOTALL)
print(matches)

# Output: ['Form 1', 'Form 2']

edited May 25 '19 at 23:26

Stan James

2,535
1
28
35

answered Oct 11 '11 at 11:09

Petr Viktorin

65,510
9
81
81

1

what does the re.S do? – Charlie Parker Feb 21 '14 at 03:09
3

Makes the `'.'` special character match any character at all, including a newline; without this flag, `'.'` will match anything *except* a newline. ( http://docs.python.org/2/library/re.html#re.S ) – Petr Viktorin Feb 21 '14 at 09:03
Oh, I see, I did go to the webpage but didn't understand the documentation because nothing was underneath re.S but now I see how to read the documentation, re.S and re.DOTALL are the same...thanks! – Charlie Parker Feb 21 '14 at 16:55
You're welcome! `re.DOTALL` is more clear, I've updated the answer. – Petr Viktorin Feb 22 '14 at 23:06
This is the best method. Just to confirm, as findall returns a normal array, access results with matches[0], matches[1], etc – moyo Nov 25 '21 at 12:00

score 40 · Answer 2 · answered Oct 11 '11 at 12:34

40

Instead of using re.search use re.findall it will return you all matches in a List. Or you could also use re.finditer (which i like most to use) it will return an Iterator Object and you can just use it to iterate over all found matches.

line = 'bla bla bla<form>Form 1</form> some text...<form>Form 2</form> more text?'
for match in re.finditer('<form>(.*?)</form>', line, re.S):
    print match.group(1)

answered Oct 11 '11 at 12:34

Aamir Rind

38,793
23
126
164

1

what does the re.S do? – Charlie Parker Feb 21 '14 at 03:16
1

`re.finditer` is exactly what I needed!Thanks! – shellbye Apr 25 '16 at 07:06
1

@Pinocchio docs say: re.S is the same as re.DOTALL ``Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.`` (posted this because I believe people like me often come to stackoverflow.com to find answers quickly) – Anton Jun 08 '17 at 11:25
1

I'm not sure why this isn't the accepted answer - `finditer` is correct. `re.search()` returns a match object. `re.finditer()` returns an iterable of all the found match objects (wrap in `list()` to just iterate the whole thing). `findall()` returns just a list of tuples. Naming `finditer` like `findall` was a poor choice - it matches more with `search` and should have been called `searchiter`. – mattmc3 Aug 08 '22 at 15:29
Yep, agree with matt. This is the best answer. The `finditer()` [match objects](https://docs.python.org/3/library/re.html#match-objects) are more powerful than the list `findall()` returns because it lets you handle each match group explicitly and get other metadata about the match(es). – Andrew Jun 22 '23 at 17:27

score 6 · Answer 3 · answered Oct 11 '11 at 11:06

6

Using regexes for this purpose is the wrong approach. Since you are using python you have a really awesome library available to extract parts from HTML documents: BeautifulSoup.

answered Oct 11 '11 at 11:06

ThiefMaster

310,957
84
592
636

1

Oh I didn't knew, I just discovered Python yesterday. :) – Stan Oct 11 '11 at 11:12

Python - Using regex to find multiple matches and print them out

3 Answers3

Linked

Related