How to split text with a RegEx?

Question

I need to parse text into sentences, but I have a little problem. I use Regex with this patern:

@"(?<=[\.!\?\...])\s+"

...to split sentences from text. But when I have text like:

Šios sutarties sąlygos taikomos „Microsoft. Hotmail“, „Microsoft. SkyDrive“, „Microsoft“ abonementui.

I need regex to split all sentence but it splits it into two sentences instead:

Šios sutarties sąlygos taikomos „Microsoft.
Hotmail“, „Microsoft. SkyDrive“, „Microsoft“ abonementui.

How can I write regex witch ignores symbols [. ! ? ...] between these symbols „ and “?

This might help. http://stackoverflow.com/questions/5695240/php-regex-to-ignore-escaped-quotes-within-quotes — Reactgular, Sep 21 '13 at 13:57
I expected get one sentense ho starts from [Šios sutarties..] till [...rosoft“ abonementui.], but not two sentinces :/ — Eimantas Žlabys, Sep 21 '13 at 14:13

Logan · Answer 1 · 2013-09-22T03:23:34.443

This is it.

Here is some details of the RE:

(.*?„.*?“)*? matches 0~unlimited groups of some words outside „some words inside“;
[^„]*?(((?<!(\\d|\\b[A-Z]))\\.)|[!?]) Escaped the dot . or ? or !, and No standalone „ before them;
((?<!(\\d|\\b[A-Z]))\\.) a substring of the on in previous item, make the dot . special, it should not prefixed with a single upper case letter or number;

Take care all the *?s, make sure we are not over matching.

using System;
using System.Text.RegularExpressions;


namespace RegexTest
{
    class MainClass
    {
        public static void Main(string[] args)
        {
            string[] cases =
            {
                "Šios sutarties sąlygos taikomos „Microsoft. Hotmail“, „Microsoft. SkyDrive“, „Microsoft“ abonementui.",
                "Šios sutarties sąlygos taikomos „Microsoft“. Hotmail, „Microsoft. SkyDrive“, „Microsoft“ abonementui! Ok? more",
                "1. Hello world. And MORE.",
                "V. Hello world. And MORE.",
                "1. V. Hello world. And MORE.",
                "I am in room 102. And you?",
            };

            var re = new Regex("(.*?„.*?“)*?[^„]*?(((?<!\\b(\\d|[A-Z]))\\.)|[!?])");

            foreach (var case_ in cases) {
                foreach (Match m in re.Matches(case_))
                    Console.WriteLine(m);

                Console.WriteLine("------------I am a splitter :) ------------");
            }
        }
    }
}

Output:

    Šios sutarties sąlygos taikomos „Microsoft. Hotmail“, „Microsoft. SkyDrive“, „Microsoft“ abonementui.
    ------------I am a splitter :) ------------
    Šios sutarties sąlygos taikomos „Microsoft“.
    Hotmail, „Microsoft. SkyDrive“, „Microsoft“ abonementui!
    Ok?
    ------------I am a splitter :) ------------
    1. Hello world.
    And MORE.
    ------------I am a splitter :) ------------
    V. Hello world.
    And MORE.
    ------------I am a splitter :) ------------
    1. V. Hello world.
    And MORE.
    ------------I am a splitter :) ------------
    I am in room 102.
    And you?
    ------------I am a splitter :) ------------

Nice its works!!!!! :) But how to write that he reads not till [.] but till [.!? ...] ? — Eimantas Žlabys, Sep 21 '13 at 15:40
Its nice regex and its works, but what do with this sentence: 1. V. Adamkus visada daug padedavo saliai. ??? Regex retyrns: 1. V. Adamkus visada daug padedavo saliai. But I need the all sentence. The regex mus not end the sentene if before [.|?|!] is number or an uppercase letter ho length is equals 1? :/ — Eimantas Žlabys, Sep 21 '13 at 16:10
@EimantasŽlabys Thanks, I may not understand your comment well. Could you please add the new cases after your question and well formatted? I'll update my answer later. — Logan, Sep 22 '13 at 03:06
@EimantasŽlabys, the python regex almost the same as C#'s, http://docs.python.org/2/library/re.html . You can try it in a REPL, which is more convenient than C#'s Write -> Compile -> Run. Here is a ref from msdn. http://msdn.microsoft.com/en-us/library/az24scfc.aspx — Logan, Sep 24 '13 at 05:01
@EimantasŽlabys, language independant tutorial: http://www.regular-expressions.info/tutorial.html — Logan, Sep 24 '13 at 05:02
Hello, I'm now trying to parse these types of words from sentence: 1) E.Žlabys 2) E.Ž. 3) Eimantas Žlabys I have written regex: @"([A-Ž]{1}\.[A-Ž]{1}\.)|([A-Ž]\s[A-Ž])| ([A-Ž]{1}\.[A-Ž]{1}[a-ž]{1,})"; But he don't work with "Eimantas Žlabys" types of words... where I did the mistake in regex to find that tipe of words? :/ — Eimantas Žlabys, Sep 25 '13 at 13:12
@EimantasŽlabys This one may work: "([A-Ž]\\.[A-Ž]\\.)|([A-Ž]+\\s[A-Ž]+)|([A-Ž]{1}\\.[A-Ž]{1}[a-ž])". Just look at "([A-Ž]+\s[A-Ž]+)". And you should notice that, the following two regex are equal: "ab[a-z]" and "a{1}b{1}[a-z]{1}". Removing all the "{1}"s makes it simple. — Logan, Oct 07 '13 at 22:46

score 0 · Answer 2 · edited Jun 20 '20 at 09:12

From my understanding you want to match any sentence ending in ?!. and ellipsis '...' while ignoring text inside „“. You also want to not end match on any single number or capital followed by ?!. or ...

In that case, this will work:

([^„]*?(„[^“]+?“)*)+?(?<!\b[\dA-Z])([?!]|[.]{1,3})

Code examples:

public static void Main()
{            
    string pattern = @"([^„]*?(„[^“]+?“)*)+?(?<!\b[\dA-Z])([?!]|[.]{1,3})";
    string input = "Šios sutarties sąlygos taikomos „Microsoft. Hotmail“, „Microsoft. SkyDrive“, „Microsoft“ abonementui.";            
    var matches = Regex.Matches( input, pattern );
    foreach( Match match in matches )
    {
        Console.WriteLine(match.Value.Trim());
    }
}

Ouput:

Šios sutarties sąlygos taikomos „Microsoft. Hotmail“, „Microsoft. SkyDrive“, „Microsoft“ abonementui.

For input: 1.The „Acme. Photo“ is good. Test string „Microsoft. Hotmail“... Some more text? Even more text! Final text.

Ouput:

1.The „Acme. Photo“ is good.

Test string „Microsoft. Hotmail“...

Some more text?

Even more text!

Final text.

Explanation of regex: ([^„]*?(„[^“]+?“)*)+?(?<!\b[\dA-Z])([?!]|[.]{1,3})

[^„]*? Match anything that is not '„'. *? means a lazy match (non-greedy).
([„][^“]+?[“])* follow this match with 0 or more instances of „“
+? means match this 1 or more times lazily (i.e. everything before !,?,.,...)
(?<!\b[\dA-Z]) means do a negative lookbehind for a single digit or uppercase letter. Basically, don't match ?!. or ... if preceded by number or capital.
([?!]|[.]{1,3}) means match follow the previous match with ? or ! or 1 to 3 . (dots/periods)

Normally I would use (?>) for performance, but I thought we would keep the regex simple. This site is very helpful.

Hope that helps.

How to split text with a RegEx?

2 Answers2