Extracting pairs of words using String.split()

Question

Given:

String input = "one two three four five six seven";

Is there a regex that works with String.split() to grab (up to) two words at a time, such that:

String[] pairs = input.split("some regex");
System.out.println(Arrays.toString(pairs));

results in this:

[one two, three four, five six, seven]

This question is about the split regex. It is not about "finding a work-around" or other "making it work in another way" solutions.

It's a puzzle... but it interested me enough to ask it, because look-behinds must be bounded in length, so it seems like a non-trivial problem. — Bohemian, May 10 '13 at 15:42
Java's look-behind is one of the strangest beast. In .NET, you can freely look-behind for variable length. In PCRE, you can only look-behind for fixed length. In Java, due to bug/feature in implementation of `+` and `*`, you *sometimes* can match variable length pattern: http://stackoverflow.com/questions/1536915/regex-look-behind-without-obvious-maximum-length-in-java — nhahtdh, May 12 '13 at 09:37

Pshemo · Accepted Answer · 2022-05-06T23:04:52.900

Currently (last tested on Java 17) it is possible to do it with split(), but in real world don't use this approach since it looks like it is based on bug since look-behind in Java should have obvious maximum length, but this solution uses \w+ which doesn't respect this limitation and somehow still works - so if it is a bug which will be fixed in later releases this solution will stop working.

Instead use Pattern and Matcher classes with regex like \w+\s+\w+ which aside from being safer also avoids maintenance hell for person who will inherit such code (remember to "Always code as if the person who ends up maintaining your code is a violent psychopath who knows where you live").

Is this what you are looking for?
_{(you can replace \\w with \\S to include all non-space characters but for this example I will leave \\w since it is easier to read regex with \\w\\s then \\S\\s)}

String input = "one two three four five six seven";
String[] pairs = input.split("(?<!\\G\\w+)\\s");
System.out.println(Arrays.toString(pairs));

output:

[one two, three four, five six, seven]

\G is previous match, (?<!regex) is negative lookbehind.

In split we are trying to

find spaces -> \\s
that are not predicted -> (?<!negativeLookBehind)
by some word -> \\w+
with previously matched (space) -> \\G
before it ->\\G\\w+.

Only confusion that I had at start was how would it work for first space since we want that space to be ignored. Important information is that \\G at start matches start of the String ^.

So before first iteration regex in negative look-behind will look like (?<!^\\w+) and since first space do have ^\\w+ before, it can't be match for split. Next space will not have this problem, so it will be matched and informations about it (like its position in input String) will be stored in \\G and used later in next negative look-behind.

So for 3rd space regex will check if there is previously matched space \\G and word \\w+ before it. Since result of this test will be positive, negative look-behind wont accept it so this space wont be matched, but 4th space wont have this problem because space before it wont be the same as stored in \\G (it will have different position in input String).

Also if someone would like to separate on lets say every 3rd space you can use this form (based on @maybeWeCouldStealAVan's answer which was deleted when I posted this fragment of answer)

input.split("(?<=\\G\\w{1,100}\\s\\w{1,100}\\s\\w{1,100})\\s")

Instead of 100 you can use some bigger value that will be at least the size of length of longest word in String.

I just noticed that we can also use + instead of {1,maxWordLength} if we want to split with every odd number like every 3rd, 5th, 7th for example

String data = "0,0,1,2,4,5,3,4,6,1,3,3,4,5,1,1";
String[] array = data.split("(?<=\\G\\d+,\\d+,\\d+,\\d+,\\d+),");//every 5th comma

You should probably change `\w+` to `\S+` in case the notional "words" aren't in fact words. Also, could you add a detailed description/explanation of why this works? It's a great regex, it would be good to make sure everyone understands it thoroughly too. — Bohemian, May 10 '13 at 21:29
As an incentive, if you add a good explanation within the next hour, I'll throw in a +50 bounty bonus! :) (actually, that's rubbish - I'm awarding the bounty anyway - you deserve it, because I for one learned something) — Bohemian, May 10 '13 at 22:03
@Bohemian Could you check if explanation in my updated answer is sufficient? — Pshemo, May 10 '13 at 22:12
I just noticed an important implication: using `\G` at the start of an otherwise unbounded look behind expression makes it considered as bounded. This may be a handy way to squeeze out boundedness from an unboundable expression instead of the ugly `{0,100}` workaround — Bohemian, Aug 20 '13 at 00:08
@dovnwoter Care to leave a comment? Is there something wrong with this answer? — Pshemo, Sep 11 '13 at 09:15
Some people are like that - it's just skill envy. BTW no prize for guessing who the bounty is for (finally!) :) — Bohemian, Sep 11 '13 at 11:40
During the bounty period, you got 14 upvotes - you'll make 100 soon and score a [gold badge](http://stackoverflow.com/help/badges/25/great-answer). Not bad - it pays to advertise! Now it occurs to me that posting a bounty gains more rep than the bounty costs. If you felt like it, you could similarly post a bounty to award [this answer](http://stackoverflow.com/questions/2290757/how-can-you-escape-the-character-in-javadoc/8463481#8463481) - I would really like a gold too :) — Bohemian, Sep 17 '13 at 00:35
That was lovely, nice alternative to doing the whole thing in regex +1 :) — zx81, Jun 10 '14 at 00:54

maybeWeCouldStealAVan · Answer 2 · 2013-05-10T17:52:58.043

9

This will work, but maximum word length needs to be set in advance:

String input = "one two three four five six seven eight nine ten eleven";
String[] pairs = input.split("(?<=\\G\\S{1,30}\\s\\S{1,30})\\s");
System.out.println(Arrays.toString(pairs));

I like Pshemo's answer better, being shorter and usable on arbitrary word lengths, but this (as @Pshemo pointed out) has the advantage of being adaptable to groups of more than 2 words.

edited May 10 '13 at 17:52

answered May 10 '13 at 16:05

maybeWeCouldStealAVan

15,492
2
23
32

1

I/m giving you a +1, but it doesn't answer the question of having arbitrarily long words. At least you got something working though. – Bohemian May 10 '13 at 16:07
2

+1 for answer that can be adapted easily to any number of words that should be grouped. – Pshemo May 10 '13 at 19:41
I've tested both solutions in a similar problem of splitting an array into pairs and it was your regex which worked for me. – Ernani Mar 12 '19 at 00:30

score 0 · Answer 3 · answered Sep 12 '13 at 22:42

0

this worked for me (\w+\s*){2}\K\s example here

a required word followed by an optional space (\w+\s*)
repeated two times {2}
ignore previously matched characters \K
the required space \s

answered Sep 12 '13 at 22:42

alpha bravo

7,838
1
19
23

Alexey · Answer 4 · 2013-05-10T16:18:46.303

-1

You can try this:

[a-z]+\s[a-z]+

Updated:

([a-z]+\s[a-z]+)|[a-z]+

enter image description here

Updated:

 String pattern = "([a-z]+\\s[a-z]+)|[a-z]+";
 String input = "one two three four five six seven";

 Pattern splitter = Pattern.compile(pattern);
 String[] results = splitter.split(input);

 for (String pair : results) {
 System.out.println("Output = \"" + pair + "\"");

edited May 10 '13 at 16:18

answered May 10 '13 at 15:31

Alexey

7,127
9
57
94

Will this grab the `seven` not matched with a 2nd word pair? – Walls May 10 '13 at 15:32
9

This does not answer the question. Your regex matches the target content, but `split()` requires a regex to match the *separators*. Your regex does not work (with `split()`) – Bohemian May 10 '13 at 15:41

Extracting pairs of words using String.split()

4 Answers4

Linked

Related