10

How to split a string into equal parts of maximum character length while maintaining word boundaries?

Say, for example, if I want to split a string "hello world" into equal substrings of maximum 7 characters it should return me

"hello "

and

"world"

But my current implementation returns

"hello w"

and

"orld   "

I am using the following code taken from Split string to equal length substrings in Java to split the input string into equal parts

public static List<String> splitEqually(String text, int size) {
    // Give the list the right capacity to start with. You could use an array
    // instead if you wanted.
    List<String> ret = new ArrayList<String>((text.length() + size - 1) / size);

    for (int start = 0; start < text.length(); start += size) {
        ret.add(text.substring(start, Math.min(text.length(), start + size)));
    }
    return ret;
}

Will it be possible to maintain word boundaries while splitting the string into substring?

To be more specific I need the string splitting algorithm to take into account the word boundary provided by spaces and not solely rely on character length while splitting the string although that also needs to be taken into account but more like a max range of characters rather than a hardcoded length of characters.

Community
  • 1
  • 1
Nav
  • 10,304
  • 20
  • 56
  • 83
  • 1
    can you add one more example of input/output with more words? – jeojavi Sep 15 '14 at 17:24
  • sure e.x. "need for speed hot pursuit" with max character range specified as say 16 ... I need the string to be split based on word boundary so the output should be "need for speed " and "hot pursuit" but currently with the implementation that i have i get "need for speed h" and "ot pursuit " – Nav Sep 15 '14 at 17:30
  • So the rule is to split at the white space that is at or before the max character range? What if the first word is longer than the character range? Do you split in the middle? Example: "reallylongwordisfirst and here are several regular words" with a length of 7 do you expect: "reallylongwordisfirst" "and " "here " "are " "several" "regular" "words"? – mdewitt Sep 15 '14 at 17:40
  • i have a max length of 4000 characters.. i wonder if there is a word with 4000 characters but anyways this is meant for the android text to speech engine which messes up with the pronunciation of the words if the word boundaries are not taken into account and on the other hand it also has a max range of characters it can accept at one time... so i hope now u can see my dillema – Nav Sep 15 '14 at 17:42
  • Do you allow splits on words longer than your limit? For example if you set max characters as `7` how `"hohohohoho merry Christmas` should be split? – Pshemo Sep 15 '14 at 17:43
  • see my above comment... the max character length is 4000 characters and is meant for normal human pronounceable words not scientific words like those which are 189,819 charcaters long http://en.wikipedia.org/wiki/Longest_word_in_English – Nav Sep 15 '14 at 17:46
  • In other words we can assume that max length of require substring will be alsways greater than length of longest word. – Pshemo Sep 15 '14 at 17:48
  • yes.. i agree on that – Nav Sep 15 '14 at 17:48

2 Answers2

16

If I understand your problem correctly then this code should do what you need (but it assumes that maxLenght is equal or greater than longest word)

String data = "Hello there, my name is not importnant right now."
        + " I am just simple sentecne used to test few things.";
int maxLenght = 10;
Pattern p = Pattern.compile("\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)", Pattern.DOTALL);
Matcher m = p.matcher(data);
while (m.find())
    System.out.println(m.group(1));

Output:

Hello
there, my
name is
not
importnant
right now.
I am just
simple
sentecne
used to
test few
things.

Short (or not) explanation of "\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)" regex:

(lets just remember that in Java \ is not only special in regex, but also in String literals, so to use predefined character sets like \d we need to write it as "\\d" because we needed to escape that \ also in string literal)

  • \G - is anchor representing end of previously founded match, or if there is no match yet (when we just started searching) beginning of string (same as ^ does)
  • \s* - represents zero or more whitespaces (\s represents whitespace, * "zero-or-more" quantifier)
  • (.{1,"+maxLenght+"}) - lets split it in more parts (at runtime :maxLenght will hold some numeric value like 10 so regex will see it as .{1,10})
    • . represents any character (actually by default it may represent any character except line separators like \n or \r, but thanks to Pattern.DOTALL flag it can now represent any character - you may get rid of this method argument if you want to start splitting each sentence separately since its start will be printed in new line anyway)
    • {1,10} - this is quantifier which lets previously described element appear 1 to 10 times (by default will try to find maximal amout of matching repetitions),
    • .{1,10} - so based on what we said just now, it simply represents "1 to 10 of any characters"
    • ( ) - parenthesis create groups, structures which allow us to hold specific parts of match (here we added parenthesis after \\s* because we will want to use only part after whitespaces)
  • (?=\\s|$) - is look-ahead mechanism which will make sure that text matched by .{1,10} will have after it:

    • space (\\s)

      OR (written as |)

    • end of the string $ after it.

So thanks to .{1,10} we can match up to 10 characters. But with (?=\\s|$) after it we require that last character matched by .{1,10} is not part of unfinished word (there must be space or end of string after it).

Pshemo
  • 122,468
  • 25
  • 185
  • 269
  • 2
    Will include explanation of this regex soon, for now test if it is what you wanted, and let me know if it works. – Pshemo Sep 15 '14 at 18:02
  • thank u it works perfectly...just one more question I have been able to put the individual groups of strings into a list..however is there a way I can specify the arralist size.. i heard its faster when the arraylists are pre intialized with the length value – Nav Sep 15 '14 at 18:32
  • 2
    If you want to get better performance you would want to avoid process of resizing ArrayList (creating 2x times bigger array than current one which will store current elements and new ones, which may be expensive process for big arrays). To avoid it you can initialize list with size which will be greater than expected number of elements, so maybe initialize it with something like `new ArrayList((int) (1.5 * text.length()) / size)`. – Pshemo Sep 15 '14 at 18:48
  • I multiplied result of `text.length()/size` because array will need to have extra space for characters which couldn't be used in single token like in `not importnant` word `important` will need to be placed in separate token because it was too long. – Pshemo Sep 15 '14 at 18:51
  • This is pretty cool - I always forget about `\\G`. `s/Lenght/Length/gi` though. – Boris the Spider Feb 12 '17 at 00:28
3

Non-regex solution, just in case someone is more comfortable (?) not using regular expressions:

private String justify(String s, int limit) {
    StringBuilder justifiedText = new StringBuilder();
    StringBuilder justifiedLine = new StringBuilder();
    String[] words = s.split(" ");
    for (int i = 0; i < words.length; i++) {
        justifiedLine.append(words[i]).append(" ");
        if (i+1 == words.length || justifiedLine.length() + words[i+1].length() > limit) {
            justifiedLine.deleteCharAt(justifiedLine.length() - 1);
            justifiedText.append(justifiedLine.toString()).append(System.lineSeparator());
            justifiedLine = new StringBuilder();
        }
    }
    return justifiedText.toString();
}

Test:

String text = "Long sentence with spaces, and punctuation too. And supercalifragilisticexpialidocious words. No carriage returns, tho -- since it would seem weird to count the words in a new line as part of the previous paragraph's length.";
System.out.println(justify(text, 15));

Output:

Long sentence
with spaces,
and punctuation
too. And
supercalifragilisticexpialidocious
words. No
carriage
returns, tho --
since it would
seem weird to
count the words
in a new line
as part of the
previous
paragraph's
length.

It takes into account words that are longer than the set limit, so it doesn't skip them (unlike the regex version which just stops processing when it finds supercalifragilisticexpialidosus).

PS: The comment about all input words being expected to be shorter than the set limit, was made after I came up with this solution ;)

walen
  • 7,103
  • 2
  • 37
  • 58
  • This solution works also if the whole string does not contain white spaces. It just splits the string. – Amio.io Sep 29 '16 at 13:50