0

I need to make regex which will capture the following:

Fixed unicode text:
<br>
<strong>
   text I am looking for
</strong>

I do something like

regex = re.compile(unicode('Fixed unicode text:.*','utf-8'))

How to modify that to capture remaining text?

LA_
  • 19,823
  • 58
  • 172
  • 308

1 Answers1

0

Simply prefix u (in Python 2.x, nothing in Python 3) to get a unicode string, and use parentheses to capture the remaining text, like this:

import re
haystack = u'Fixed unicode text:\n<br><strong>\ntext I\nam looking for</strong>'
match = re.search(ur'Fixed unicode text:(.*)', haystack, re.DOTALL)
print(match.group(1))

However, it looks like your input is HTML. If that's the case, you should not use a regular expression, but parse the HTML with lxml, BeautifulSoup, or another HTML parser.

Community
  • 1
  • 1
phihag
  • 278,196
  • 72
  • 453
  • 469
  • Thanks, phihag. I've tried to add line breaks support here (`re.M | re.S | re.U`), but it doesn't capture the whole text. What can be the reason? (see updated question) And, I use BeautifulSoup. I need this regex to be used with BeautifulSoup ;). – LA_ Jan 26 '12 at 17:37
  • @LA_ Neither MULTILINE(M) nor UNICODE(U) have an effect for your regexp. Specifying `DOTALL`(S) is enough. I updated the answer to include that. – phihag Jan 26 '12 at 17:41
  • hmm, looks like there is some problem with BeautifulSoup - still it captures just one line. http://stackoverflow.com/questions/9007653/how-to-find-tag-with-particular-text-with-beautiful-soup – LA_ Jan 26 '12 at 17:46
  • Umm, if you're using an HTML parser, you use its traversal and selection methods, namely XPath, not regular expressions. In any case, can you post the code (including the input - use `\n` to encode newlines) that fails, and explain how it fails? – phihag Jan 26 '12 at 17:49
  • 1
    I've updated original question - http://stackoverflow.com/questions/9007653/how-to-find-tag-with-particular-text-with-beautiful-soup – LA_ Jan 26 '12 at 17:57