split a string that contain english and Hebrew in c#

Question

I have this string:

string str = "לא קיימת תוכנה לשליחת מיילים במכשיר, אנא פנה אלינו ישירות ל moshecohen@gmail.com";

and I'm trying to split it the following way:

string[0] = "לא קיימת תוכנה לשליחת מיילים במכשיר, אנא פנה אלינו ישירות ל "
string[1] = "moshecohen@gmail.com"

I'm using this split method:

string[] split =  Regex.Split(str, @"^[א-ת]+$");

I want to split between Hebrew and English words, but if the last word is the same as the current add it to the last

But I can not make it work, what am I doing wrong?

Thanks

"I can not make it work" - what do you get instead of the expected outcome? — Vladi Pavelka, Sep 03 '18 at 09:17
What is the rule? Split a string with whitespaces before an email? — Wiktor Stribiżew, Sep 03 '18 at 09:18
The pattern specifies the *splitter*. Your code asks for strings that are separated by any Hebrew character, but only if the *entire* string is in Hebrew. That's self-contradictory. Perhaps you want to split between the last Hebrew and the first Latin character? — Panagiotis Kanavos, Sep 03 '18 at 09:19
Possible duplicate of [extract all email address from a text using c#](https://stackoverflow.com/questions/2333835/extract-all-email-address-from-a-text-using-c-sharp) — Paolo, Sep 03 '18 at 09:23
sorry for not being clear. I want to split between Hebrew and English words, but if the last word is the same as the current add it to the last. — SilverCrow, Sep 03 '18 at 09:26
@SilverCrow Do not add this in comments, add all the details to the question. Modify title to reflect what you really need. — Wiktor Stribiżew, Sep 03 '18 at 09:27

Vladi Pavelka · Answer 1 · 2018-09-03T09:30:20.967

1

Try this:

string[] split = Regex.Split(str, @"(?<=[א-ת]+) (?=[A-z]+)")

?<= - lookbehind - Asserts what immediately PRECEDES the current position

?= - lookahead - Asserts what immediately FOLLOWS the current position

This will resolve the string "splitter" as the place between Hebrew and Latin characters

edited Sep 03 '18 at 09:30

answered Sep 03 '18 at 09:28

Vladi Pavelka

916
4
12

2

This `(?<=[א-ת]+) (?=[A-z]+)` won't work in many cases (e.g. when a Hebrew letter follows the ASCII one or when there are more than 1 spaces between the words), and note that [`[A-z]` does not match only letters](http://stackoverflow.com/a/29771926/3832970). – Wiktor Stribiżew Sep 03 '18 at 09:30
@WiktorStribiżew could you please provide an example of a "[hebrew] [e-mail address]" string where what I wrote wouldn't work? – Vladi Pavelka Sep 03 '18 at 09:32
Sure, he can fine-tune the [A-z] part to only match an e-mail and not more – Vladi Pavelka Sep 03 '18 at 09:33
You could use named blocks to capture Hebrew and non-Hebrew characters, eg `(?<=\p{IsHebrew}) (?=\P{IsHebrew})")` – Panagiotis Kanavos Sep 03 '18 at 09:48

Kobi · Answer 2 · 2018-09-03T10:36:41.377

Here's one approach:

[\p{IsHebrew}\P{L}]+|\P{IsHebrew}+

Use this pattern with Regex.Matches:

var matches = Regex.Matches(input, @"[\p{IsHebrew}\P{L}]+|\P{IsHebrew}+");

The pattern has two parts. It either matches:

[\p{IsHebrew}\P{L}]+ - a block containing Hebrew characters and non-letters,

OR

\P{IsHebrew}+ - a block of non-Hebrew characters (including non-Hebrew letters and other non-letter characters).

We're using Unicode Named Blocks like \p{IsHebrew} and \p{IsBasicLatin}.

A similar option is [\p{IsHebrew}\P{L}]+|[\p{IsBasicLatin}\P{L}]+ - is matches specifically a block with Latin (English) letters.

Working example: regex storm, C# example

score 0 · Answer 3 · answered Sep 03 '18 at 09:20

Why don't you think differently? The question here is: How to get the emails from the text.

There is a lot of posts for this question.

For example, this

public static void emas(string text)
        {
            const string MatchEmailPattern =
           @"(([\w-]+\.)+[\w-]+|([a-zA-Z]{1}|[\w-]{2,}))@"
           + @"((([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\.([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\."
             + @"([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\.([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])){1}|"
           + @"([a-zA-Z]+[\w-]+\.)+[a-zA-Z]{2,4})";
            Regex rx = new Regex(MatchEmailPattern,  RegexOptions.Compiled | RegexOptions.IgnoreCase);
            // Find matches.
            MatchCollection matches = rx.Matches(text);
            // Report the number of matches found.
            int noOfMatches = matches.Count;
            // Report on each match.
            foreach (Match match in matches)
            {
                Console.WriteLine(match.Value.ToString());
            }
        }

Please, have a look at these sites: TLD list: https://www.iana.org/domains/root/db ; valid/invalid addresses: https://en.wikipedia.org/wiki/Email_address#Examples ; regex for RFC822 email address: http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html — Toto, Sep 03 '18 at 09:25

score 0 · Answer 4 · answered Sep 03 '18 at 09:28

0

From your input string, we can consider that we can split the string to Hebrew and an email address in the end of the string.

Then the regex can be( just example):

\w*@gmail.com$

You can test the regex here: https://regexr.com/

answered Sep 03 '18 at 09:28

Nhan Phan

1,262
1
14
32

score 0 · Answer 5 · answered Sep 03 '18 at 09:40

The pattern in Regex.Split matches the delimiter and isn't included in the results. Looks like you want to split between the last Hebrew and first non-Hebrew character, eg :

Regex.Split(str,@"\p{IsHebrew} \P{IsHebrew}")

\p{} captures a character that belongs to a specific Unicode character class or named block while \P{} excludes it.

Unfortunately, this pattern will exclude the last Hebrew and first non-Hebrew character and return :

לא קיימת תוכנה לשליחת מיילים במכשיר, אנא פנה אלינו ישירות   
oshecohen@gmail.com

Capture groups are used to include characters captured by a delimiter pattern in the results. Simply using a group though with (\p{IsHebrew}) (\P{IsHebrew}) will return each capture group as a separate result :

לא קיימת תוכנה לשליחת מיילים במכשיר, אנא פנה אלינו ישירות  
ל 
m 
oshecohen@gmail.com

Vladi Pavelka's use of forward and back references fixes this and (?<=\p{IsHebrew}) (?=\P{IsHebrew}) will return the expected results :

Regex.Split(str,@"(?<=\p{IsHebrew}) (?=\P{IsHebrew})")

will return :

לא קיימת תוכנה לשליחת מיילים במכשיר, אנא פנה אלינו ישירות ל 
moshecohen@gmail.com

score 0 · Answer 6 · answered Sep 03 '18 at 10:30

why not simply use \p{IsHebrew} ?

something like this

 string str = "לא קיימת תוכנה לשליחת מיילים במכשיר, אנא פנה אלינו ישירות ל moshecohen@gmail.com";
 string pattern = @"[\p{IsHebrew}]+";
 var hebrewMatchCollection = Regex.Matches(str, pattern);
 string hebrewPart = string.Join(" ", hebrewMatchCollection.Cast<Match>().Select(m => m.Value));  //combine regex collection
 var englishPart = Regex.Split(str, pattern).Last();

split a string that contain english and Hebrew in c#

6 Answers6