0

For one of my projects I need to split paragraphs into sentences. I have already found that you can use the following code to break the paragraph(s) into different sentences then print them:

BreakIterator iterator = BreakIterator.getSentenceInstance(Locale.US);
iterator.setText(content);
int start = iterator.first();
for (int end = iterator.next();
    end != BreakIterator.DONE;
    start = end, end = iterator.next()) {
System.out.println(content.substring(start,end));

Where the variable 'content' is a predefined variable.

However, I would like to have the broken down sentences to be strings so that I can continue using them.

How would I do this? I think it may have something to do with a string array. Thanks for your help.

Sotirios Delimanolis
  • 274,122
  • 60
  • 696
  • 724
  • Why don't you use the .split(), pass the appropriate delimiter(s) and receive all your sentences in a string[]? – TJ- Aug 02 '14 at 17:37
  • @TJ- I didn't use the .split() because I feel that it would not split the paragraph(s) correctly. For example, if I split by periods, then dates such as Aug. 8, 2014 would be split even though it is not a sentence. Or, if I split by a period, then a capital letter, then Mr. Johnson would be split. – systemcode Aug 02 '14 at 17:46
  • Actually now that I think about it, using BreakIterator also has the same problem with names. Do you think there is anyway to fix that? – systemcode Aug 02 '14 at 17:49
  • Yes, String.split(String regex) - supports split using regex. Come up with a good regex that caters to your needs. From what I see, there will be a lot of cases. – TJ- Aug 04 '14 at 13:51

2 Answers2

0

I've never used BreakIterator, I assume you want it for locale purposes (FYI: here and here). Either way, you can keep the sentences in an array or List, as you've mentioned.

BreakIterator iterator = BreakIterator.getSentenceInstance(Locale.US);
iterator.setText(content);
int start = iterator.first();

List<String> sentences = new ArrayList<String>();
for (int end = iterator.next(); end != BreakIterator.DONE; start = end, end = iterator.next()) {
    //System.out.println(content.substring(start,end));
    sentences.add(content.substring(start,end));
}
Community
  • 1
  • 1
lebolo
  • 2,120
  • 4
  • 29
  • 44
0

Try this which i got from this link

public static void main(String[] args) {
    String content =
            "Line boundary analysis determines where a text " +
            "string can be broken when line-wrapping. The " +
            "mechanism correctly handles punctuation and " +
            "hyphenated words. Actual line breaking needs to " +
            "also consider the available line width and is " +
            "handled by higher-level software. ";

    BreakIterator iterator =
            BreakIterator.getSentenceInstance(Locale.US);

    Arraylist<String> sentences = count(iterator, content);

}

private static Arraylist<String> count(BreakIterator bi, String source) {
    int counter = 0;
    bi.setText(source);

    int lastIndex = bi.first();
    Arraylist<String> contents = new ArrayList<>(); 
    while (lastIndex != BreakIterator.DONE) {
        int firstIndex = lastIndex;
        lastIndex = bi.next();

        if (lastIndex != BreakIterator.DONE) {
            String sentence = source.substring(firstIndex, lastIndex);
            System.out.println("sentence = " + sentence);
            contents.add(sentence);
            counter++;
        }
    }
    return contents;
}
prashant thakre
  • 5,061
  • 3
  • 26
  • 39
  • The only problem that I have with that code is it is basically the same as my code since there is no way to differentiate the sentence strings. Is there a way to get it to set the strings as sentence1, sentence2, sentence3, etc.? – systemcode Aug 02 '14 at 17:58
  • In my code I have used Arraylist of string type in method count and storing all the sentences and returning the Arraylist. So finally you are having one ArrayList which contains sentence1, sentence2, sentence3 etc.., Let me know if you are still not clear. – prashant thakre Aug 02 '14 at 18:02
  • Oh sorry, I did not notice the Arraylist :/. Sorry about that. I'll try this code to see if it works for me. – systemcode Aug 02 '14 at 18:07
  • I have a slight problem with the Arraylist. When I use "System.out.println(contents.get(1))" to print the sentence stored at index 1, I get an error java.lang.IndexOutOfBoundsException (the size of the Arraylist is only 1). When I access index 0, it prints out the first sentence 5 times. Do you know what is wrong with the code? – systemcode Aug 02 '14 at 19:44
  • Its due to content having simple string , to check it properly create one text file and insert some sentences which having paragraph and read the same and assign to content. – prashant thakre Aug 02 '14 at 20:01
  • I attempted to read it from a text file, but I still get the same error. Is it because I converted the text file to a string using " String content = new String(readAllBytes(get("text.txt")));"? – systemcode Aug 04 '14 at 16:23