How to make regex with unicode symbols?

Question

I need to make regex which will capture the following:

Fixed unicode text:
<br>
<strong>
   text I am looking for
</strong>

I do something like

regex = re.compile(unicode('Fixed unicode text:.*','utf-8'))

How to modify that to capture remaining text?

score 0 · Accepted Answer · edited May 23 '17 at 11:48

0

Simply prefix u (in Python 2.x, nothing in Python 3) to get a unicode string, and use parentheses to capture the remaining text, like this:

import re
haystack = u'Fixed unicode text:\n<br><strong>\ntext I\nam looking for</strong>'
match = re.search(ur'Fixed unicode text:(.*)', haystack, re.DOTALL)
print(match.group(1))

However, it looks like your input is HTML. If that's the case, you should not use a regular expression, but parse the HTML with lxml, BeautifulSoup, or another HTML parser.

edited May 23 '17 at 11:48

Community

1
1

answered Jan 26 '12 at 17:26

phihag

278,196
72
453
469

Thanks, phihag. I've tried to add line breaks support here (`re.M | re.S | re.U`), but it doesn't capture the whole text. What can be the reason? (see updated question) And, I use BeautifulSoup. I need this regex to be used with BeautifulSoup ;). – LA_ Jan 26 '12 at 17:37
@LA_ Neither MULTILINE(M) nor UNICODE(U) have an effect for your regexp. Specifying `DOTALL`(S) is enough. I updated the answer to include that. – phihag Jan 26 '12 at 17:41
hmm, looks like there is some problem with BeautifulSoup - still it captures just one line. http://stackoverflow.com/questions/9007653/how-to-find-tag-with-particular-text-with-beautiful-soup – LA_ Jan 26 '12 at 17:46
Umm, if you're using an HTML parser, you use its traversal and selection methods, namely XPath, not regular expressions. In any case, can you post the code (including the input - use `\n` to encode newlines) that fails, and explain how it fails? – phihag Jan 26 '12 at 17:49
1

I've updated original question - http://stackoverflow.com/questions/9007653/how-to-find-tag-with-particular-text-with-beautiful-soup – LA_ Jan 26 '12 at 17:57

How to make regex with unicode symbols?

1 Answers1