regex match with * not matching text with non-English characters

Question

I am trying to scrape a page that has Hebrew text on it. It contains the following piece of HTML:

<div id="AgeRating">דירוג גיל: ‎12+‎</div>

I just want the 12+ part here (in fact: I only want the '12' part). I am currently doing to with this piece of regex for other languages:

new Regex(@"<div id=""AgeRating"">.*(\d{1,2})\+</div>", RegexOptions.Compiled);

But I just can't get this to match. I tried all the regex options like RightToLeft, CultureInvariant, SingleLine, MultiLine, etc. but nothing works. It does work fine with plenty other languages though.

Note: I'm aware of HtmlAgilityPack for proper parsing of HTML. This is question about why seemingly correct RegEx fails to match particular string (as this a sample I have currently).

I have edited your title. Please see, "[Should questions include “tags” in their titles?](http://meta.stackexchange.com/questions/19190/)", where the consensus is "no, they should not". — John Saunders, Feb 24 '13 at 02:46
This question turned out not to be about "Regex to match Hebrew" - I've edited it to remove Hebrew from title as it is not actually about matching Hebrew text and inlined HTmlAgilityPack comment. All old comments can be removed. (Leon Cullens, feel free to revert/improve my edit, but keep in mind that question showed up on searching for ""regex for Hebrew text" and it really is not). — Alexei Levenkov, Sep 26 '14 at 14:39
If you are looking for matching Hebrew - check out http://stackoverflow.com/questions/9197003/regular-expression-with-hebrew-or-english/9242066#9242066 — Alexei Levenkov, Sep 26 '14 at 14:39

Justin O Barber · Accepted Answer · 2013-02-24T13:47:43.883

4

This regular expression works for me:

<div id="AgeRating">.*?(\d{1,2})\+

This returns 12. I added a ? to .* to make the dot not greedy.

I think the thing that is throwing you off is that you have a hidden character (perhaps a Hebrew character?) after the plus sign. The following also works for your string (notice the dot after the plus sign, which accommodates your hidden character):

<div id="AgeRating">.*?(\d{1,2})\+.</div>

You also do need the ? after .* as I mentioned above in order to prevent the regular expression from returning 2 instead of 12.

edited Feb 24 '13 at 13:47

answered Feb 24 '13 at 02:41

Justin O Barber

11,291
2
40
45

Do you know why my solution works for most languages but not Hebrew and yours does? – Leon Cullens Feb 24 '13 at 13:23

regex match with * not matching text with non-English characters

1 Answers1