It is even possible with regex and split()
?
Yes, but it's quite messy to fully implement what you describe. Note in particular that your specification characterizes the substrings you want to accept, whereas split()
works in terms of matching delimiters between substrings.
You can nevertheless do this kind of thing by using zero-width lookaround assertions for your delimiter patterns, but that turns out to require a long and ugly regex to accurately implement your specific requirements. More than anything else, the 5-character window makes a real mess of things. Java regex does support the special \G
to match the trailing boundary of the previous match (if any), which makes the job possible.
Here's the best pattern I've come up with:
(?x) (?<= \\G\\w{5} )
| (?<= \\G .{4} \\W )
| (?<= \\G .{3} \\W ) (?= \\w )
| (?<= \\G .{2} \\W ) (?= \\w{2} | \\w\\z )
| (?<= \\G . \\W ) (?= \\w{3} | \\w{1,2}\\z )
| (?<= \\G \\W ) (?= \\w{4} | \\w{1,3}\\z )
(Note that comments mode is enabled to make whitespace in the pattern insignificant.)
There is one alternative for the delimiter implicitly following five word characters since the last match, and one for each possible token length for tokens ending in a non-word character. I observe in passing that the delimiter does not necessarily fall at the first non-word / word boundary in such cases nor necessarily at such a boundary at all, but rather after the last non-word character of the five at a time under consideration. Additionally, it is not necessary for a delimiter to be present after the last token.