2

I want to split a text into sentences (split by . or BreakIterator). But: Each sentence mustn't have more than 100 characters.

Example:

Lorem ipsum dolor sit. Amet consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore
magna aliquyam erat, sed diam voluptua. At vero eos et accusam
et justo duo dolores.

To: (3 elements, without breaking a word, but a sentence)

" Lorem ipsum dolor sit. ",
" Amet consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt
  ut labore et dolore magna",
" aliquyam erat, sed diam voluptua. At vero eos et accusam
  et justo duo dolores. "

How can I do this properly?

DragonWork
  • 2,415
  • 1
  • 18
  • 20

4 Answers4

3

There's probably a better way to do it, but here it goes:

public static void main(String... args) {

    String originalString = "Lorem ipsum dolor sit. Amet consetetur sadipscing elitr,sed diam nonumy eirmod tempor invidunt ut labore "
            + "et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores.";


    String[] s1 = originalString.split("\\.");
    List<String> list = new ArrayList<String>();

    for (String s : s1)
        if (s.length() > 100)
            list.addAll(Arrays.asList(s.split("(?<=\\G.{100})")));
        else
            list.add(s);

    System.out.println(list);
}

The "split string in size" regex is from this SO question. You probably could integrate the two regex'es, but I'm not sure that would be a wise idea (:

If the regex doesn't run in Andrond (the \G operator is not recognized everywhere), try the other solutions linked to split a string based on its size.

Community
  • 1
  • 1
Marcelo
  • 4,580
  • 7
  • 29
  • 46
  • 1
    Thank you, I'll try it. ( **.** must be escaped ;D ) – DragonWork Feb 20 '12 at 15:21
  • This code doesn't compile and has a couple of bugs (most obviously that the first split regex should be "\\.", and that there is no add(String[]) method on a List), but if you fix those minor issues it does work and produces the approximate output he asked for. – cutchin Feb 20 '12 at 15:28
  • I only fetched the regex, and I got this error: 02-20 16:28:10.538: E/AndroidRuntime(2328): java.util.regex.PatternSyntaxException: Look-behind pattern matches must have a bounded maximum length near index 13: 02-20 16:28:10.538: E/AndroidRuntime(2328): (?<=\G.{100}) – DragonWork Feb 20 '12 at 15:29
  • This will cut strings in the middle of words :( – Macarse Feb 20 '12 at 15:31
2

Regex will not help you a lot with this kind of situations.

I would split the text using spaces or . and afterwards start concatenating. Something like this:

Pseudo code

words = text.split("[\s\.]");
lines = new List();
while ( words.length() > 0 ) {

  String line = new String();
  while ( line.length() + words.get(0).length() < 100 ) {
    line += words.get(0);
    words.remove(words.get(0));
  }

  lines.add(line);

}
Macarse
  • 91,829
  • 44
  • 175
  • 230
2

Solved (thank you Macarse for the inspiration):

String[] words = text.split("(?=[\\s\\.])");
ArrayList<String> array = new ArrayList<String>();
int i = 0;
while (words.length > i) {
    String line = "";
    while ( words.length > i && line.length() + words[i].length() < 100 ) {
        line += words[i];
        i++;
    }
    array.add(line);
}
DragonWork
  • 2,415
  • 1
  • 18
  • 20
0

Following the previous solutions, I quickly got into a problem with an infinite loop for the case when each word may exceed the limit (very unlikely, but unfortunately I have a very constrained environment). So, I added a fix (kinda) for this edge case (I think).

import java.util.*;

public class Main
{
    public static void main(String[] args) {
        sentenceToLines("In which of the following, a person is constantly followed/chased by another person or group of several people?", 15);
    }

    private static ArrayList<String> sentenceToLines(String s, int limit) {
        String[] words = s.split("(?=[\\s\\.])");
        ArrayList<String> wordList =  new ArrayList<String>(Arrays.asList(words));
        ArrayList<String> array = new ArrayList<String>();
        int i = 0, temp;
        String word, line;
        while (i < wordList.size()) {
            line = "";
            temp = i;
            // split the long words to the size of the limit
            while(wordList.get(i).length() > limit) {
                word = wordList.get(i);
                wordList.add(i++, word.substring(0, limit));
                wordList.add(i, word.substring(limit));
                wordList.remove(i+1);
            }
            i = temp;
            // continue making lines with newly split words
            while ( i < wordList.size() && line.length() + wordList.get(i).length() <= limit ) {
                line += wordList.get(i);
                i++;
            }
            System.out.println(line.trim());
            array.add(line.trim());
        }
        return array;
    }
    
}
Srichakradhar
  • 1,535
  • 1
  • 12
  • 24