Regex to recognize Hebrew unicode characters or just Hebrew characters

Question

I am trying to figure out a regular expression to use with the Flex regex engine with C++, so that I can parse a construct from my programming language, where the keywords are in Hebrew. One of the construct/patterns the regex needs to recognize is:

קו

Regex I've tried:
"קו" (קו) [\u05E7\u05D5] [\u05D5]{1}[\u05E7]{1} [^\b\u05D5][\u05E7\b]

The first one worked but then my other regex pattern recognized it too which I don't want which is:

`[קראטוןםפשדגכעיחלךףזסבהנמצתץ]+`

Also, tried to use unicode for the above pattern which is below - it did not work
[\u05D0-\u05EA]+

Ideally, I want my regex pattern to be able to match the following string combo or the one below it
קו אחד = שלום קו אחד

For the above, I tried these regex patterns but none worked: (קו)(\s)[קראטוןםפשדגכעיחלךףזסבהנמצתץ]+ (וק)\s+[קראטוןםפשדגכעיחלךףזסבהנמצתץ]+ [קראטוןםפשדגכעיחלךףזסבהנמצתץ]+\s+(וק)

Ideally, in all my regex expressions, I'd like to use the unicode characters.

Also, this is the table that I've been using for the unicode characters: this link

Moreover, I have looked at these questions and have also tried the posted solutions which nothing worked. I only want to use the unicode system for the Hebrew letters that don't have dots which is only unicode characters u05D0-u05EA and these questions cover the unicode characters with the dot system. Regardless, I can't seem to get replacing the dotted unicode characters with the non-dotted unicode characters to work:
tried all solutions here
read through this, tried solution, no success
and this is for PHP, so not very helpful as I'm using C++

I think you should be able to use the syntax `\p{Hebrew}`, to indicate the Unicode-script category property for Hebrew. See https://www.regular-expressions.info/unicode.html#script. I’m not familiar enough with regular-expression handling in PHP to know if you need to wrap some additional syntax around that, or use some particular PHP flags to indicate it. But it’s my understanding that PHP’s regular-expression engine is PCRE-conformant, and all PCRE-conformant engines support specifying the Unicode-script category properties defined at https://www.regular-expressions.info/unicode.html#script. — sideshowbarker, Mar 10 '20 at 17:36
See also the http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt resource, showing the Unicode code-point ranges (and individual code points in between) that are referenced by the `\p{Hebrew}` property — which appear to be a total of 134 code points: 0591..05BD, 05BE, 05BF, 05C0, 05C1..05C2, 05C3, 05C4..05C5, 05C6, 05C7, 05D0..05EA, 05EF..05F2, 05F3..05F4, FB1D, FB1E, FB1F..FB28, FB29, FB2A..FB36, FB38..FB3C, FB3E, FB40..FB41, FB43..FB44, FB46..FB4F — sideshowbarker, Mar 10 '20 at 17:47
@Wiktor Stribiżew C++ regex appears to be different than Javascript regex. — developer01, Mar 11 '20 at 18:02
@slideshowbarker I apologize, I'm not using PHP. I realize the linked question appears to be misleading so I changed the link tag. I am using C++ -- I did encounter this `\p{Hebrew}` in my readings for PHP which is a very nice asset. Do you know if C++ contains anything of the sort? Based on my research, I couldn't find anything indicating so. — developer01, Mar 11 '20 at 18:03
It does not matter since the Unicode units are the same across all these regex engines. `\p{Hebrew}` is not supported by `std::regex`, but you might try some luck with `boost::regex`. — Wiktor Stribiżew, Mar 11 '20 at 20:47

score 4 · Accepted Answer · answered Mar 10 '20 at 17:18

4

You need to use two ranges of characteres,

U+0590-05FF (/*פ,ש*/) and
U+FB1D-FB4F (/*Pres: ﬡ,טּ*/).

So, you can try the regex:

[\u0590-\u05FF\uFB1D-\uFB4F]+

answered Mar 10 '20 at 17:18

Paul Vargas

41,222
15
102
148

It recognizes "דחא" but then it can't recognize the next line. Appears it's more of a bison issue. – developer01 Mar 11 '20 at 19:26

Regex to recognize Hebrew unicode characters or just Hebrew characters

1 Answers1