Splitting a long Urdu sentence into smaller based on conjunctions in C#

Question

Here is what i did so far. The problem is if a conjunction appears twice in the sentence the code doesnt work for the 2nd appearance of the conjunction. plz if any expert can help ?

    private void SplitSentence_Click(object sender, EventArgs e)
    {
        richTextBox2.Text = "";
        richTextBox3.Text = "";
        string[] keywords = { " or ", " and ", " hence", "so that", "however", " because" };
        string[] sentences = SentenceTokenizer(richTextBox1.Text);
        string remSentence;

        foreach (string sentence in sentences)
        {
           remSentence = sentence;
            richTextBox3.Text = remSentence;
            for (int i =0; i < keywords.Length; i++)
            {
               if ((remSentence.Contains(keywords[i])))// || (remSentence.IndexOf(keywords[i]) > 0))
                {

                  richTextBox2.Text += remSentence.Substring(0, remSentence.IndexOf(keywords[i])) + '\n' + keywords[i] + '\n';
                  remSentence = remSentence.Substring(remSentence.IndexOf(keywords[i]) + keywords[i].Length);

                }                   

             }
            richTextBox2.Text += remSentence;
        }
    }

    public static string[] SentenceTokenizer(string text)
    {
        char[] sentdelimiters = new char[] { '.', '?', '۔', '؟', '\r', ':', '-' }; //    '{ ',' }', '( ', ' )', ' [', ']', '>', '<','-', '_', '= ', '+','|', '\\', ':', ';', ' ', '\'', ',', '.', '/', '?', '~', '!','@', '#', '$', '%', '^', '&', '*', ' ', '\r', '\n', '\t'};
        // text.Remove('\n');
        return text.Split(sentdelimiters, StringSplitOptions.RemoveEmptyEntries);
    }

score 1 · Answer 1 · edited May 23 '17 at 12:28

1

Instead of doing things manually, you could take care of this with regular expressions. I'll use English in my example so that I don't accidentally butcher poor Urdu.

using System.Text.RegularExpressions;

Regex r = new Regex("\b(and|or|hence)");
sentence = r.Replace(sentence, "|");     // Just something unlikely to be normal.
string[] phrases = sentence.Split ('|'); // Each piece between conjunctions.

You may need to tweak it for capitalization(?) and the possibility that a conjunction might be part of another word (I used a leading space--or word boundary from @Drahcir's suggestion--to start that process). See this answer for working with .NET's version of back-references.

edited May 23 '17 at 12:28

Community

1
1

answered Mar 28 '14 at 17:30

John C

1,931
1
22
34

1

Could use a word boundary `\b` instead of a leading space – Drahcir Mar 28 '14 at 17:58
Good plan. It's a non-English target, after all. Not on both sides, though, since that could miss "and/or" and similar. – John C Mar 28 '14 at 18:00
1

@John C thanks it works. in certain conditions the word "and" appears in between tow nouns instead of phrases. e.g. Akram and Jhon are best freinds. here the "and" needs to b handled seperatly in context.. is there any way to handle such conditions ? – Khan Mar 29 '14 at 10:31
No, at that point, you need to start parsing, unfortunately. What we're doing here is "lexical analysis," picking out special tokens in the text. Parsing takes those tokens and finds patterns of grammar. That becomes much more tricky, since natural languages (especially more mature languages and languages that are on geopolitical boundaries) aren't "regular." But if the input is simple enough, you can look at "parser generators" to see what's feasible. – John C Mar 29 '14 at 12:46
@JohnC I have a lexicon of 20 words that normally appears before the "and (اور)" in urdu language. what i want is to have a way to check the word before "and" against the lexicon and if found the sentence is broken else display the complete sentence. Splitting with regex i lost the control over the word "and" for further processing. i m not intending to develop a complete parser or lex. analyzer. – Khan Mar 30 '14 at 07:19
I understand that, but the difference between "find these words" and "find these words, but only in context" is parsing. It might be possible to do something with back-references (link in the last sentence of my answer) to match a word in parentheses and "replace" it with itself (`${1}`), but since a regular expression can't make decisions, you may run into a problem with consecutive conjunctions again. – John C Mar 30 '14 at 10:46

Splitting a long Urdu sentence into smaller based on conjunctions in C#

1 Answers1