28

How can I split a text or paragraph into sentences using Stanford parser?

Is there any method that can extract sentences, such as getSentencesFromString() as it's provided for Ruby?

jogojapan
  • 68,383
  • 11
  • 101
  • 131
S Gaber
  • 1,536
  • 7
  • 24
  • 43
  • http://nlp.stanford.edu/software/ – Brian Roach Feb 29 '12 at 02:29
  • 1
    I already download the parser package and run a simple program on it, i would like to have some ideas about extracting the sentences from the text using the parser, Is there any method that i can use to extract the sentences from text .. – S Gaber Feb 29 '12 at 02:34

12 Answers12

31

You can check the DocumentPreprocessor class. Below is a short snippet. I think there may be other ways to do what you want.

String paragraph = "My 1st sentence. “Does it work for questions?” My third sentence.";
Reader reader = new StringReader(paragraph);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);
List<String> sentenceList = new ArrayList<String>();

for (List<HasWord> sentence : dp) {
   // SentenceUtils not Sentence
   String sentenceString = SentenceUtils.listToString(sentence);
   sentenceList.add(sentenceString);
}

for (String sentence : sentenceList) {
   System.out.println(sentence);
}
chunjy92
  • 39
  • 8
Kenston Choi
  • 2,862
  • 1
  • 27
  • 37
  • 1
    this code token into words. what I am looking for is splitting the paragraph into sentences – S Gaber Feb 29 '12 at 04:17
  • You can consider concatenating the tokens to form the sentence, since you already have the 'sentence' variable present. Or maybe, you can try to substring(..) the text. Have you checked other methods present in the DocumentPreprocessor class? – Kenston Choi Feb 29 '12 at 06:02
  • 1
    I'm just a beginner in this thing, if you don't mind, could you provide simple example about this .... – S Gaber Feb 29 '12 at 06:48
  • I updated the code to reflect my first suggestion. This may not be perfect, I think Stanford's tokenizer would replace parenthesis symbols with another token symbol. Maybe there's some methods that only perform sentence splitting rather than returning them as tokens. I haven't checked that yet. – Kenston Choi Feb 29 '12 at 07:39
  • 1
    You're welcome, but try a sentence with parenthesis and quotes, I think part of the tokenization process, it replaces it with some symbols. – Kenston Choi Feb 29 '12 at 11:25
  • 1
    This puts whitespace between tokens. Eg. before every period which is not great – user1893354 Feb 06 '15 at 20:17
  • You can create simple if-conditions to handle such cases (e.g., the last punctuation in a sentence should not have a space before it) – Kenston Choi Feb 14 '15 at 05:14
  • 9
    I simplified the code by using an enhanced for loop and making use of a convenience method in the Sentence class which will convert a list of tokens back into a String. – Christopher Manning Mar 20 '15 at 20:18
  • This answer works faster than @kevin's answer for just sentence splitting task: 0.14 seconds vs 0.25 seconds. – Hamid Rouhani Nov 14 '16 at 12:40
  • @chunjy92 edited the use of Sentence to SentenceUtils. This can also be traced to the class rename: https://github.com/stanfordnlp/CoreNLP/commit/cb01027fd3d9c387575cf1ce488390620e6f6ac6 – Kenston Choi Jan 31 '17 at 07:24
24

I know there is already an accepted answer...but typically you'd just grab the SentenceAnnotations from an annotated doc.

// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution 
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

// read some text in the text variable
String text = ... // Add your text here!

// create an empty Annotation just with the given text
Annotation document = new Annotation(text);

// run all Annotators on this text
pipeline.annotate(document);

// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(SentencesAnnotation.class);

for(CoreMap sentence: sentences) {
  // traversing the words in the current sentence
  // a CoreLabel is a CoreMap with additional token-specific methods
  for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
    // this is the text of the token
    String word = token.get(TextAnnotation.class);
    // this is the POS tag of the token
    String pos = token.get(PartOfSpeechAnnotation.class);
    // this is the NER label of the token
    String ne = token.get(NamedEntityTagAnnotation.class);       
  }

}

Source - http://nlp.stanford.edu/software/corenlp.shtml (half way down)

And if you're only looking for sentences, you can drop the later steps like "parse" and "dcoref" from the pipeline initialization, it'll save you some load and processing time. Rock and roll. ~K

Christopher Manning
  • 9,360
  • 34
  • 46
Kevin
  • 1,420
  • 2
  • 13
  • 11
17

There are a couple issues with the accepted answer. First, the tokenizer transforms some characters, such as the character “ into the two characters ``. Second, joining the tokenized text back together with whitespace does not return the same result as before. Therefore, the example text from the accepted answer transforms the input text in non-trivial ways.

However, the CoreLabel class that the tokenizer uses keeps track of the source characters they are mapped to, so it is trivial to rebuild the proper string, if you have the original.

Approach 1 below shows the accepted answers approach, Approach 2 shows my approach, which overcomes these issues.

String paragraph = "My 1st sentence. “Does it work for questions?” My third sentence.";

List<String> sentenceList;

/* ** APPROACH 1 (BAD!) ** */
Reader reader = new StringReader(paragraph);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);
sentenceList = new ArrayList<String>();
for (List<HasWord> sentence : dp) {
    sentenceList.add(Sentence.listToString(sentence));
}
System.out.println(StringUtils.join(sentenceList, " _ "));

/* ** APPROACH 2 ** */
//// Tokenize
List<CoreLabel> tokens = new ArrayList<CoreLabel>();
PTBTokenizer<CoreLabel> tokenizer = new PTBTokenizer<CoreLabel>(new StringReader(paragraph), new CoreLabelTokenFactory(), "");
while (tokenizer.hasNext()) {
    tokens.add(tokenizer.next());
}
//// Split sentences from tokens
List<List<CoreLabel>> sentences = new WordToSentenceProcessor<CoreLabel>().process(tokens);
//// Join back together
int end;
int start = 0;
sentenceList = new ArrayList<String>();
for (List<CoreLabel> sentence: sentences) {
    end = sentence.get(sentence.size()-1).endPosition();
    sentenceList.add(paragraph.substring(start, end).trim());
    start = end;
}
System.out.println(StringUtils.join(sentenceList, " _ "));

This outputs:

My 1st sentence . _ `` Does it work for questions ? '' _ My third sentence .
My 1st sentence. _ “Does it work for questions?” _ My third sentence.
dantiston
  • 5,161
  • 2
  • 26
  • 30
  • Thanks, this is precisely what I was looking for. Wish there was a quick way to get this from Stanford CoreNLP's server or simple class call from command line. I don't want to create entire java project just for one use case. Any idea how to modify this to take a text file as input (and not load it as a string?) – wadkar Apr 20 '16 at 23:34
  • @Sudhi I think it'd be great if CoreNLP decided to add this as an easy option, maybe even the default option. The easiest way to modify it to take a file would be to pass your reader directly into the PTBTokenizer object (instead of the StringReader), but that's going to read your file into memory. You'd need to engineer something to process one sentence at a time in memory. – dantiston Apr 21 '16 at 19:45
  • actually, it appears that Sentence class has a method `Sentence.listToOriginalTextString` which takes the `List sentence` variable in your code. It also mentions that the PTBT tokenizer needs to be run with `"invertible=true"` option. – wadkar Apr 23 '16 at 06:29
  • @Sudhi there is a listToOriginalTextString method now, but it operates similarly to listToString method Dan used in the accepted answer. If you have the original text (which you may not), my method doesn't require the -invertible=true flag and is much more efficient (O(2) vs O(n)). – dantiston Apr 23 '16 at 15:51
  • @dantiston Hey thanks for the approach. I am using your approach in a for loop as i have multiple paragraphs which needs to be segmented. The paragraphs are present in a list array. When i individually run a particular line it segments perfectly but when i pass it using the loop the result is not the same. Any idea why this could be happening ? – navinkb Jun 13 '16 at 15:22
  • @navinkb not without seeing your code. Try asking a new question and feel free to mention me in a comment. – dantiston Jun 13 '16 at 19:21
  • @dantiston Thanks for providing these 2 approaches; really helpful in seeing the ways a text could be sentence-split. However, I would argue that approach #1 is actually ideal depending on what you're doing downstream. This is really important sentence-level normalization that can be really useful when one wants to treat messy data the same as pristine newspaper writing. However, should you ever need to reference sentences produced by #1 in the original text, prepare yourself for pain, while with #2 this is really straightforward :) – dmn Nov 22 '17 at 17:26
  • 1
    @dmn I agree. I don't think CoreNLP's decision to do sentence splitting after tokenization is unrealistic in most real world scenarios, as you often want the tokens and the sentence boundaries are a either meaningless or provide a different layer of information. However, it is annoying to handle when you want just the sentence boundaries. – dantiston Nov 23 '17 at 00:00
9

Using the .net C# package: This will split sentences, get the parentheses correct and preserve original spaces and punctuation:

public class NlpDemo
{
    public static readonly TokenizerFactory TokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(),
                "normalizeParentheses=false,normalizeOtherBrackets=false,invertible=true");

    public void ParseFile(string fileName)
    {
        using (var stream = File.OpenRead(fileName))
        {
            SplitSentences(stream);
        }
    }

    public void SplitSentences(Stream stream)
    {            
        var preProcessor = new DocumentPreprocessor(new UTF8Reader(new InputStreamWrapper(stream)));
        preProcessor.setTokenizerFactory(TokenizerFactory);

        foreach (java.util.List sentence in preProcessor)
        {
            ProcessSentence(sentence);
        }            
    }

    // print the sentence with original spaces and punctuation.
    public void ProcessSentence(java.util.List sentence)
    {
        System.Console.WriteLine(edu.stanford.nlp.util.StringUtils.joinWithOriginalWhiteSpace(sentence));
    }
}

Input: - This sentence's characters possess a certain charm, one often found in punctuation and prose. This is a second sentence? It is indeed.

Output: 3 sentences ('?' is considered an end-of-sentence delimiter)

Note: for a sentence like "Mrs. Havisham's class was impeccable (as far as one could see!) in all aspects." The tokenizer will correctly discern that the period at the end of Mrs. is not an EOS, however it will incorrectly mark the ! within the parentheses as an EOS and split "in all aspects." as a second sentence.

Yaniv.H
  • 790
  • 6
  • 5
2

With the Simple API provided by Stanford CoreNLP version 3.6.0 or 3.7.0.

Here's an example with 3.6.0. It works exactly the same with 3.7.0.

Java Code Snippet

import java.util.List;

import edu.stanford.nlp.simple.Document;
import edu.stanford.nlp.simple.Sentence;
public class TestSplitSentences {
    public static void main(String[] args) {
        Document doc = new Document("The text paragraph. Another sentence. Yet another sentence.");
        List<Sentence> sentences = doc.sentences();
        sentences.stream().forEach(System.out::println);
    }
}

Yields:

The text paragraph.

Another sentence.

Yet another sentence.

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>stanfordcorenlp</groupId>
    <artifactId>stanfordcorenlp</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
    </properties>

    <dependencies>
        <!-- https://mvnrepository.com/artifact/edu.stanford.nlp/stanford-corenlp -->
        <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>3.6.0</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/com.google.protobuf/protobuf-java -->
        <dependency>
            <groupId>com.google.protobuf</groupId>
            <artifactId>protobuf-java</artifactId>
            <version>2.6.1</version>
        </dependency>
    </dependencies>
</project>
Community
  • 1
  • 1
cindyxiaoxiaoli
  • 808
  • 1
  • 8
  • 18
1

You can use the document preprocessor. It's really easy. Just feed it a filename.

    for (List<HasWord> sentence : new DocumentPreprocessor(pathto/filename.txt)) {
         //sentence is a list of words in a sentence
    }
bernie2436
  • 22,841
  • 49
  • 151
  • 244
1

You can pretty easy use Stanford tagger for this.

String text = new String("Your text....");  //Your own text.
List<List<HasWord>> tokenizedSentences = MaxentTagger.tokenizeText(new StringReader(text));

for(List<CoreLabel> act : tokenizedSentences)       //Travel trough sentences
{
    System.out.println(edu.stanford.nlp.ling.Sentence.listToString(act)); //This is your sentence
}
Delirante
  • 809
  • 8
  • 12
0

A variation in the @Kevin answer which will solve the question is as follows:

for(CoreMap sentence: sentences) {
      String sentenceText = sentence.get(TextAnnotation.class)
}

which gets you the sentence information without bothering with the other annotators.

demongolem
  • 9,474
  • 36
  • 90
  • 105
0

Another element, not addressed except in a few downvoted answers, is how to set the sentence delimiters? The most common way, the default, is to depend up the common punctuation marks which state the end of a sentence. There are other document formats that one might face from drawing upon gathered corpora, one of which being each line is it's own sentence.

To set your delimiters for the DocumentPreprocessor as in the accepted answers, you would use setSentenceDelimiter(String). To use the pipeline approach suggested as in the answer by @Kevin, one would work with the ssplit properties. For example, to use the end of line scheme proposed in the previous paragraph, one would set the property ssplit.eolonly to true

demongolem
  • 9,474
  • 36
  • 90
  • 105
0

Add Path for input and output file in below code:-

import java.util.*;
import edu.stanford.nlp.pipeline.*;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
public class NLPExample
{
    public static void main(String[] args) throws IOException 
    {
        PrintWriter out;
        out = new PrintWriter("C:\\Users\\ACER\\Downloads\\stanford-corenlp-full-     
        2018-02-27\\output.txt");
        Properties props=new Properties();
        props.setProperty("annotators","tokenize, ssplit, pos,lemma");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        Annotation annotation;  
        String readString = null;
        PrintWriter pw = null;
        BufferedReader br = null;
        br = new BufferedReader (new 
        FileReader("C:\\Users\\ACER\\Downloads\\stanford- 
        corenlp-full-2018-02-27\\input.txt" )  ) ;
        pw = new PrintWriter ( new BufferedWriter ( new FileWriter ( 
        "C:\\Users\\ACER\\Downloads\\stanford-corenlp-full-2018-02-   
        27\\output.txt",false 
        ))) ;      
        String x = null;
        while  (( readString = br.readLine ())  != null)
        {
            pw.println ( readString ) ; String 
            xx=readString;x=xx;//System.out.println("OKKKKK"); 
            annotation = new Annotation(x);
            pipeline.annotate(annotation);    //System.out.println("LamoohAKA");
            pipeline.prettyPrint(annotation, out);
        }
        br.close (  ) ;
        pw.close (  ) ;
        System.out.println("Done...");
    }    
}
Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
Rahul Shah
  • 21
  • 2
-4
public class k {

public static void main(String a[]){

    String str = "This program splits a string based on space";
    String[] words = str.split(" ");
    for(String s:words){
        System.out.println(s);
    }
    str = "This     program  splits a string based on space";
    words = str.split("\\s+");
}
}
Robert
  • 5,278
  • 43
  • 65
  • 115
sayali
  • 1
-5

use regular expression for split text into sentences, in use Regex but in java i dont know.

code

string[] sentences = Regex.Split(text, @"(?<=['""a-za-z][\)][\.\!\?])\s+(?=[A-Z])");

90% works

danial
  • 1