2

Expanding on this answer, using this regex (?<=\\G.{" + count + "}); I would also like to modify the expression to not split words in the middle.

Example:

String string = "Hello I would like to split this string preserving these words";

if I want to split on 10 characters it would look like this:

[Hello I wo, uld like t, o split th, is string , preserving, these wor, ds]

Question:

Is this even possible using only regex, or would a lexer or some other string manipulation be needed?

UPDATE

This is what I want to use it on:

 + -------------------------------------------JVM Information------------------------------------------ + 
 | sun.boot.class.path : C:\Program Files\Java\jdk1.6.0_33\jre\lib\resources.jar;C:\Program Files\Java\ | 
 |                       jdk1.6.0_33\jre\lib\rt.jar;C:\Program Files\Java\jdk1.6.0_33\jre\lib\sunrsasig | 
 |                       n.jar;C:\Program Files\Java\jdk1.6.0_33\jre\lib\jsse.jar;C:\Program Files\Java | 
 |                       \jdk1.6.0_33\jre\lib\jce.jar;C:\Program Files\Java\jdk1.6.0_33\jre\lib\charset | 
 |                       s.jar;C:\Program Files\Java\jdk1.6.0_33\jre\lib\modules\jdk.boot.jar;C:\Progra | 
 |                       m Files\Java\jdk1.6.0_33\jre\classes                                           | 
 + ---------------------------------------------------------------------------------------------------- + 

The box surrounding it has the character limit minus the key width, however this does not look good. This example is also not the only use-case, i use that box for multiple types of information.

Community
  • 1
  • 1
epoch
  • 16,396
  • 4
  • 43
  • 71
  • 1
    Can you edit this to become a self-contained question? (Keep the link, though) – Thilo Sep 06 '12 at 08:54
  • I would use a simple lexer. It might be slightly longer but it would be easier to understand. ;) – Peter Lawrey Sep 06 '12 at 08:55
  • @PeterLawrey, thanks, I will start working on that, unless someone comes up with some magical regex ;) – epoch Sep 06 '12 at 09:03
  • In my experience, regular expression can get you pretty far, but it cannot do everything. You could probably produce a regex for splitting words OR a regex for splitting every n characters, but there is no way to combine these two regular expressions in any way other than "or". My advice is to split by words and generate a method which selects multiple words based on # of characters provided by the user. – Neil Sep 06 '12 at 09:08
  • So what is your desired result (example please)? Should the "split" be shorter or longer in case of a word being present? Or not split at all in that case and try next 10 chars? I think all of this is possible with regex. – Qtax Sep 06 '12 at 14:10

3 Answers3

4

I have looked at this problem and none of those replies actually convinced me! Here is my version. It is very likely that it can be improved.

public static String[] splitPresenvingWords(String text, int length) {
    return text.replaceAll("(?:\\s*)(.{1,"+ length +"})(?:\\s+|\\s*$)", "$1\n").split("\n");
}
Tk421
  • 6,196
  • 6
  • 38
  • 47
1

No regex, but it seems to work:

List<String> parts = new ArrayList<String>();
while (true) {
    // look for space to the left of n-th character
    int index = string.lastIndexOf(" ", n);
    if (index == -1) {
        // no space to the left (very long word) -> next space to the right
        // change this to 'index = n' to break words in this case
        index = string.indexOf(" ", n);
    }
    if (index == -1) {
        break;
    }
    parts.add(string.substring(0,  index));
    string = string.substring(index+1);
}
parts.add(string);

This will first look if there is a space to the left of the n-th character. In this case, the string is split there. Otherwise, it looks for the next space to the right. Alternatively, you could break the word in this case.

tobias_k
  • 81,265
  • 12
  • 120
  • 179
  • 2
    This does not take into consideration word breaks which are not space, such as newline, period, colon, semicolon, etc. At that point, better to use regular expression to find word breaks and the rest of your algorithm to add it to the list. – Neil Sep 06 '12 at 09:19
  • @tobiask, the problem with this is that my `n` is a hard-limit, the string cannot be longer than `n` – epoch Sep 06 '12 at 09:20
  • Extended the code, but now that I see your example, it may be better to search for `'\'` instead of `' '`, or to use a regex for this part, as Neil points out. – tobias_k Sep 06 '12 at 09:35
1

"not split words in the middle" does not define what should happen in case of "not splitting".

Given the split length being 10 and the string:

Hello I would like to split this string preserving these words

If you want to split right after a word, resulting in the list:

Hello I would, like to split, this string, preserving, these words

You can accomplish all kinds of tricky "splits" by using plain matching.

Simply match all occurences of this expression:

(?s)\G.{10,}?\b

(Using (?s) to turn on the DOTALL flag.)

In Perl it's as simple as @array = $str =~ /\G.{10,}?\b/gs, but Java seems to lack a quick function to return all matches, so you'd probably have to use a matcher and push the results on to an array/list.

Qtax
  • 33,241
  • 9
  • 83
  • 121