1
String x=" i am going to the party at 6.00 in the evening. are you coming with me?";

if i have the above string, i need that to be broken to sentences by using sentence boundry punctuations(like . and ?)

but it should not split the sentence at 6 because of having an pointer there. is there a way to identify what is the correct sentence boundry place in java? i have tried using stringTokenizer in java.util pakage but it always break the sentence whenever it finds a pointer. Can someone suggest me a method to do this correctly?

This is the method which i have tried in tokenizing a text into sentences.

public static ArrayList<String> sentence_segmenter(String text) {
    ArrayList<String> Sentences = new ArrayList<String>();

    StringTokenizer st = new StringTokenizer(text, ".?!");
    while (st.hasMoreTokens()) {

        Sentences.add(st.nextToken());
    }
    return Sentences;
}

also i have a method to segement sentences into phrases, but here also when the program found comma(,) it splits the text. but i dont need to split it when there is a number like 60,000 with a comma in the middle. following is the method i am using to segment the phrases.

   public static ArrayList<String> phrasesSegmenter(String text) {
    ArrayList<String> phrases = new ArrayList<String>();
    StringTokenizer st = new StringTokenizer(text, ",");
    while (st.hasMoreTokens()) {
        phrases.add(st.nextToken());
    }
    return phrases;
}
Chirath
  • 57
  • 1
  • 10
  • 2
    You need to use sentence splitters for this. See related question: http://stackoverflow.com/questions/9492707/how-can-i-split-a-text-into-sentences-using-the-stanford-parser – Kenston Choi Nov 03 '14 at 06:08

2 Answers2

1

From the documentation of StringTokenizer:

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.

In case you'd use split, you can use any regular expression to split the text into sentences. You probably want something like any of ?!. and either a space or end of text:

text.split("[?!.]($|\\s)")
fejese
  • 4,601
  • 4
  • 29
  • 36
  • but how to identify the correct place to split the sentect. if there is a decimal number in the middle of a sentence. then the period having there cannot be taken as the end of a sentence. i need to know how to handle those situations @fejese – Chirath Nov 02 '14 at 22:12
  • 1
    Here's a regex fiddler: http://regex101.com/r/vB7gU9/1 Note that I removed the double escaping of the white space character matcher (`\s`) and added a `.*?` at the beginning to make it more visible what will returned as the first element after the split – fejese Nov 02 '14 at 22:18
  • Still didn't got my answer. @fejese your regex is not working – Chirath Nov 03 '14 at 05:51
  • @ fejese thanks buddy. its working now. but i got another problem, i have edited my post can you check it and give me a solution? Thank alot. – Chirath Nov 05 '14 at 14:56
  • If you find that my answer is fit to your question then accept it and if you have additional, new questions, then open a new one. However your recent question is quite the same as the original one. Spend some time to understand why and how the solution in this answer works and it should be pretty straightforward to adapt it to your new question. – fejese Nov 05 '14 at 15:13
  • your method works for both decimal numbers and words(abbreviations) as well. but i only need to handle the decimal numbers. can you edit your answer according to my requirement? – Chirath Nov 05 '14 at 20:58
  • @Chirath, there's no way to tell if a word-dot combination is an abbreviation or a sentence ending. That is unless we add more conditions, like the first letter of a sentence starts with capital letter (not like in your example) etc. See? the "etc." in the last sentence is both an abbreviation and a Sentence ending. Try figuring out all the rules you need or have and think about how you can combine those. (You can ask a new question once you tried it and you find obstacles) – fejese Nov 06 '14 at 11:01
1

Here is my Solution to the problem.

/** tries to decide if a there's a sentence-end in index i of a given text

 * @param text
 * @param i
 * @return
 */
public static boolean isSentenceEnd(String text, int i) {
    char c = text.charAt(i);
    return isSentenceEndChar(c) && !isPeriodWord(text, i);
} 
/**
 * PeriodWords are words such as 'Dr.' or 'Mr.'
 *
 * @param text - the text to examoine.
 * @param i - index of the priod '.' character
 * @return
 */
private static String[] periodWords = { "Mr.", "Mrs.", "Ms.", "Prof.", "Dr.", "Gen.", "Rep.", "Sen.", "St.",
                "Sr.", "Jr.", "Ph.", "Ph.D.", "M.D.", "B.A.", "M.A.", "D.D.", "D.D.S.",
                "B.C.", "b.c.", "a.m.", "A.M.", "p.m.", "P.M.", "A.D.", "a.d.", "B.C.E.", "C.E.",
                "i.e.", "etc.", "e.g.", "al."};
private static boolean isPeriodWord(String text, int i) {
    if (i < 4) return true;
    if (text.charAt(i-2) == ' ') return true; // one char words are definetly priodWords
    String txt = text.substring(0, i);
    for (String pword: periodWords) {
        if (txt.endsWith(pword)) return true;
    }
    if (txt.matches("^.*\\d\\.$")) return true; // dates seperated with "." or numbers with fraction
    return false;
}

private static final char[] sentenceEndChars = {'.', '?', '−'};
private static boolean isSentenceEndChar(char c) {
    for (char sec : sentenceEndChars) {
        if (c == sec) return true;
    }
    return false;
}
Eli Mashiah
  • 200
  • 2
  • 5