0

If the text was

<textarea> xyz asdf qwr </textarea>

I'm trying to write a regular expression which will help me extract the text in bold.

So far I have reached [(<textarea)][</textarea>)] which will capture the tags but I haven't been able to actually capture the text in between the two tags.

I also tried [(<textarea)]+.[</textarea>)] and even [[(<textarea)]+.[</textarea>)] but that too isn't giving the right results.

Can someone please throw some light on this or share some links which will help me reach a solution?

Ashwin
  • 1,190
  • 2
  • 10
  • 30

3 Answers3

3

Is there a particular reason that you must use regular expression to parse what seems like HTML? I wouldn't do it. See RegEx match open tags except XHTML self-contained tags for the best explanation.

This becomes really simple if you use the BeautifulSoup module, which is going to be far better at parsing HTML (especially if it is messy HTML).

import bs4

f = open("test.html")
soup = bs4.BeautifulSoup(f)

for textarea in soup.find_all('textarea'):
    print textarea.get_text()
Community
  • 1
  • 1
mdadm
  • 1,333
  • 1
  • 12
  • 9
1

You shouldn't parse HTML with regex - parse it with a HTML parser! See this answer.

That being said, if you must use a regex::

The square brackets [] mean "match any character inside", so [<(textarea)] means "match <, (, t, e, x, t, a, r, or )".

You probably want <textarea>(.*?)</textarea>, with group 1 (the first set of brackets) being the contents of the tag.

This will have problems (for example) if the user writes "</textarea>" inside the text area; then only up to the first occurence of "</textarea>" will be extracted. However if you make it non-greedy and do <textarea>.*</textarea> then if you have multiple textarea tags, the .* will match over both of them instead of each individually. Such are the pitfalls of using regex with HTML.

Community
  • 1
  • 1
mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194
1

I think you were struggling to understand that the "+" and "*" operators refer to the pattern they follow, not the pattern they precede.

>>> import re
>>> re.match(r"\<textarea\>.*\<textarea/\>", target)
>>> re.match(r"\<textarea\>.*\</textarea>", target)
<_sre.SRE_Match object at 0x106528b90>
>>> mo = re.match(r"\<textarea\>.*\</textarea>", target)
>>> mo.groups()
()
>>> mo.group(0)
'<textarea> xyz asdf qwr </textarea>'
>>> mo = re.match(r"\<textarea\>(.*)\</textarea>", target)
>>> mo.groups()
(' xyz asdf qwr ',)
>>> mo.group(0)
'<textarea> xyz asdf qwr </textarea>'
>>> mo.group(1)
' xyz asdf qwr '
>>>

Does that help?

holdenweb
  • 33,305
  • 7
  • 57
  • 77
  • Also follow the good advice about a) being wary of greedy matching, and b) considering the use of a suitable HTML parser. – holdenweb Mar 24 '14 at 23:38