2

What would be the best way to split a text without punctuation in Java on sentence level?

The text may contain multiple sentences without punctuation, e.g.:

String text = "i ate cornflakes it is a sunny day i have to wash my car";
String[] sentences = splitOnSentenceLevel(text);
System.out.print(Arrays.toString(sentences));
>>>["i ate cornflakes", "it is a sunny day", "i have to wash my car"]

The only solution I could find is to train an n-gram model that tells the probability of each position being the end of a sentence, trained on punctuated text data. But setting that up seems like a huge task.

public String[] splitOnSentenceLevel(String text) {
    List<String> sentences = new ArrayList<String>();
    String currentSentence = "";
    for(String word: text.split(" ")) {
        currentSentence += " " + word;
        if(nGramClassifierIsLastWordOfSentence(word)) {
            sentences.add(currentSentence);
            currentSentence = "";
        }
    }
    String[] sentencesArray = new String[ sentences.size() ];
    sentences.toArray( sentencesArray );
    return sentencesArray;
}

The Stanford CoreNLP toolkit doesn't seem to have such a feature either. The task is obviously ambiguous, but is there a simpler way of at least approximating a solution? The text I would like to analyze would contain relatively simple, short sentences.

bear
  • 663
  • 1
  • 14
  • 33
  • 2
    I know nothing about NLP. Does it have a way to tell whether something you've given it is a grammatically correct sentence? If so, you could start with all the words and keep deleting the last word until it says "yes". – ajb Mar 06 '17 at 00:38
  • 1
    This is probably going to be a huge task regardless unless you use a library to do the heavy lifting for you. This isn't a simple problem. – Carcigenicate Mar 06 '17 at 00:46
  • 2
    This is a duplicate of https://stackoverflow.com/questions/11371476/sentence-segmentation-tools-to-use-when-input-sentence-has-no-punctuation-is-no - however, neither does that question have a good answer. A better discussion is found on another SO site: https://linguistics.stackexchange.com/questions/3167/are-there-sentence-boundary-disambiguation-algorithms-which-can-handle-punctuati – fnl Mar 06 '17 at 14:17

0 Answers0