Split word by character count or last index of non-word character

Question

I'm having string like this:

aaaaaas#aa##aa

And I want to use split() and regex to accomplish this algoritm:

get 5 first characters
if it has some non-word character, cut to last non-word character, including non-word character
if it doesn't have any non-word character, cut this 5 characters
repeat from last cut until string ends

Return of this example should be like this:

aaaaa
as#
aa##
aa

It is even possible with regex and split()? This

.*([\W]+)\W

gives me characters to last non-word character (in example it would be aaaaaas#aa##) but how to group it to max. 5 character, split and continue from end of previous match?

https://regex101.com/r/xA9kG3/14

Does input `a#a#a#a#` get split to `a#`, `a#`, `a#`, `a#`, or does it get split to `a#a#`, `a#a#`? Bullet 1 says to *"get 5 characters"*, i.e. `a#a#a`, then bullet 2 says to *"cut to **last** non-word character"*, i.e. `a#a#`. But did you mean for it to cut after `a#`? — Andreas, Feb 27 '17 at 17:33
OP comment under my (now deleted answer) `aa#####aa should output aaa##, ##aa, because last word is shorter then 5 characters, sory, I didn't mentioned it. So I should add bullet: if splitted word is last 5 characters or shorter then 5 characters (so they are last characters in string) return it` — Pshemo, Feb 27 '17 at 18:50
To be honest your question looks like [X/Y problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). You gave us some steps but we still don't know *what is the point*? Maybe you are looking for something like: http://stackoverflow.com/questions/25853393/split-a-string-in-java-into-equal-length-substrings-while-maintaining-word-bound? — Pshemo, Feb 27 '17 at 19:32

John Bollinger · Answer 1 · 2017-02-27T19:01:55.210

It is even possible with regex and split()?

Yes, but it's quite messy to fully implement what you describe. Note in particular that your specification characterizes the substrings you want to accept, whereas split() works in terms of matching delimiters between substrings.

You can nevertheless do this kind of thing by using zero-width lookaround assertions for your delimiter patterns, but that turns out to require a long and ugly regex to accurately implement your specific requirements. More than anything else, the 5-character window makes a real mess of things. Java regex does support the special \G to match the trailing boundary of the previous match (if any), which makes the job possible.

Here's the best pattern I've come up with:

(?x) (?<= \\G\\w{5} )
   | (?<= \\G  .{4} \\W )
   | (?<= \\G  .{3} \\W ) (?= \\w )
   | (?<= \\G  .{2} \\W ) (?= \\w{2} | \\w\\z )
   | (?<= \\G  .    \\W ) (?= \\w{3} | \\w{1,2}\\z )
   | (?<= \\G       \\W ) (?= \\w{4} | \\w{1,3}\\z )

(Note that comments mode is enabled to make whitespace in the pattern insignificant.)

There is one alternative for the delimiter implicitly following five word characters since the last match, and one for each possible token length for tokens ending in a non-word character. I observe in passing that the delimiter does not necessarily fall at the first non-word / word boundary in such cases nor necessarily at such a boundary at all, but rather after the last non-word character of the five at a time under consideration. Additionally, it is not necessary for a delimiter to be present after the last token.

Split word by character count or last index of non-word character

1 Answers1