Multiple matches in single java regexp

Question

Is it possible to match the following in a single regular expression to get the first word, and then a list of the numbers?

this 10 12 3 44 5 66 7 8    # should return "this", "10", "12", ...
another 1 2 3               # should return "another", "1", "2", "3"

EDIT1: My actual data is not this simple, the digits are actually more complex patterns, but for illustration purposes, I've reduced the problem to simple digits, so I do require a regex answer.

The numbers are unknown in length on each line, but all match a simple pattern.

The following only matches "this" and "10":

([\p{Alpha}]+ )(\d+ ?)+?

Dropping the final ? matches "this" and "8".

I had thought that the final group (\d+ ?)+ would do the digit matching multiple times, but it doesn't and I can't find the syntax to do it, if possible.

I can do it in multiple passes, by only searching for the name and latter numbers separately, but was wondering if it's possible in a single expression? (And if not, is there a reason?)

EDIT2: As I mentioned in some of the comments, this was a question in Advent of Code (Day 7, 2020). I was looking to find cleanest solution (who doesn't love a bit of polishing?)

Here's my ultimate solution (kotlin) I used, but spent too long trying to do it in 1 regex, so I posted this question.

val bagExtractor = Regex("""^([\p{Alpha} ]+) bags contain""")
val rulesExtractor = Regex("""([\d]+) ([\p{Alpha} ]+) bag""")

// bagRule is a line from the input
val bag = bagExtractor.find(bagRule)?.destructured!!.let { (n) -> Bag(name = n) }
val contains = rulesExtractor.findAll(bagRule).map { it.destructured.let { (num, bagName) -> Contain(num = num.toInt(), bag = Bag(bagName)) } }.toList()
Rule(bag = bag, contains = contains)

Despite now knowing it can be done in 1 line, I haven't implemented it, as I think it's cleaner in 2.

Looking at this, can't you simply split on spaces? And if not, why? — JvdV, Dec 07 '20 at 17:18
this is a very simplified version of the actual input, where the final numbers are more complex patterns (actually of the pattern " ") that exhibit the same behaviour, only matching the first or last expression, never the full list of items. — Mark Fisher, Dec 07 '20 at 17:24
Yes, use `String pat = "(\\G(?!^)|\\b\\p{L}+\\b)\\s+(\\d+)";`. Group 1 will only be matched when the initial word is matched. You need to use it with `matcher.find` and some extra code logic. — Wiktor Stribiżew, Dec 07 '20 at 18:12
This is wizardry! I tested this at https://www.freeformatter.com/java-regex-tester.html#ad-output and as you say, the initial group is slightly askew, but otherwise is pretty good. the matches give "other 1", "2", "3". — Mark Fisher, Dec 07 '20 at 23:38

Arvind Kumar Avinash · Accepted Answer · 2020-12-07T17:54:17.443

1

I think what you are looking for can be achieved by splitting the string on \s+ unless I am missing something.

import java.util.Arrays;

public class Main {
    public static void main(String[] args) {
        String str = "this 10 12 3 44 5 66 7 8";
        String[] parts = str.split("\\s+");
        System.out.println(Arrays.toString(parts));
    }
}

Output:

[this, 10, 12, 3, 44, 5, 66, 7, 8]

If you want to select just the alphabetical text and the integer text from the string, you can do it as

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        String str = "this 10 12 3 44 5 66 7 8";
        Matcher matcher = Pattern.compile("(\\b\\p{Alpha}+\\b)|(\\b\\d+\\b)").matcher(str);
        while (matcher.find()) {
            System.out.println(matcher.group());
        }
    }
}

Output:

or as

import java.util.List;
import java.util.regex.MatchResult;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

public class Main {
    public static void main(String[] args) {
        String str = "this 10 12 3 44 5 66 7 8";

        List<String> list = Pattern.compile("(\\b\\p{Alpha}+\\b)|(\\b\\d+\\b)")
                            .matcher(str)
                            .results()
                            .map(MatchResult::group)                                                        
                            .collect(Collectors.toList());

        System.out.println(list);
    }
}

Output:

[this, 10, 12, 3, 44, 5, 66, 7, 8]

edited Dec 07 '20 at 17:54

answered Dec 07 '20 at 17:24

Arvind Kumar Avinash

71,965
6
74
110

I should have commented faster :) No, this isn't possible with the real data, the "digits" in my actual data are more complex structures made up of multiple words, but do follow a pattern I can match – Mark Fisher Dec 07 '20 at 17:26
@MarkFisher - Can you please post here an actual sample (after hiding PII, if any)? – Arvind Kumar Avinash Dec 07 '20 at 17:34
The sample data given should be good enough to test on. I have a solution which is to split the example regex I gave into 2 parts and scan the input twice with each regex. That works fine, I just don't understand why the combination of them doesn't work. – Mark Fisher Dec 07 '20 at 17:38
@MarkFisher - I've posted an update. If the input and output are not as per your expectation, feel free to comment with an example input and the expected output. – Arvind Kumar Avinash Dec 07 '20 at 17:58
1

That works! Nice answer. I went with `([\p{Alpha} ]+) bags contain|(\d+) ([\p{Alpha} ]+) bag` on the actual input data which is matching everything I need on the line. Cheers. – Mark Fisher Dec 07 '20 at 18:44
@MarkFisher *The sample data given should be good enough to test on* Apparently not. You said you wanted to match on a single word followed by numbers. This will match on words and numbers in any order which is actually easier since the pattern may alternate. – WJS Dec 07 '20 at 20:31

score 0 · Answer 2 · answered Dec 07 '20 at 17:27

No. The notion of "find me all of a certain regexp" is just not done with incrementing groups. You're really asking for why regexp is what it is? That's... an epic thesis that delves into some ancient computing history and a lot of Larry Wall (author of Perl, which is more or less where regexps came from) interviews, that seems a bit beyond the scope of SO. They work that way because regexps work that way, and those work that way because they've worked that way for decades and changing them would mess with people's expectations; let's not go any deeper than that.

You can do this with scanners instead:

Scanner s = new Scanner("this 10 12 3 44 5 66 7 8");
assertEquals("this", s.next());
assertEquals(10, s.nextInt());
// etc

or even:

Scanner s = new Scanner("this 10 12 3 44 5 66 7 8");
assertEquals("this", s.next());
assertEquals(10, s.nextInt());
// etc

or even:

Scanner s = new Scanner("this 10 12 3 44 5 66 7 8");
assertEquals("this", s.next(Pattern.compile("[\p{Alpha}]+"));
assertEquals(10, s.nextInt());

s = new Scanner("--00invalid-- 10 12 3 44 5 66 7 8");
// the line below will throw an InputMismatchException
s.next(Pattern.compile("[\p{Alpha}]+"));

NB: Scanners tokenize (they split the input into a sequence of token, separator, token, separator, etc - then tosses the separators and gives you the tokens). .next(Pattern) does not mean: Keep scanning until you hit something that matches. It just means: Grab the next token. If it matches this regexp, great, return it. Otherwise, crash.

So, the real magic is in making scanner tokenize as you want. This is done by use .useDelimiter() and is also regexp based. Some fancy footwork with positive lookahead and co can get you far, but it's not infinitely powerful. You didn't expand on the actual structure of your input so I can't say if it'll suffice for your needs.

An example of the actual input is `posh crimson bags contain 2 mirrored tan bags, 1 faded red bag, 1 striped gray bag.` which some may recognise from AOC 2020 day 7 today. I got an answer using 2 regex: `^([\p{Alpha} ]+) bags contain` and `([\d]+) ([\p{Alpha} ]+) bag` but wanted a single expression to work if possible matching the beginning and then multiple values on the end of the line. — Mark Fisher, Dec 07 '20 at 17:33

WJS · Answer 3 · 2020-12-07T20:12:04.910

0

You said you had to use a regex. But how about a hybrid solution. Use the regex to verify the format and then split the values on spaces or the delimiter of your choosing. I also returned the value in an optional so you could check on its availability before use.

String[] data = { "this 10 12 3 44 5 66 7 8",
        "Bad Data 5 5 5",
        "another 1 2 3" };

for (String text : data) {
    Optional<List<String>> op = parseText(text);
    if (!op.isEmpty()) {
        System.out.println(op.get());
    }
}

Prints

[this, 10, 12, 3, 44, 5, 66, 7, 8]
[another, 1, 2, 3]

static String pattern = "([a-zA-Z]+)(\\s+\\d+)+";
    
public static Optional<List<String>> parseText(String text) {
    if (text.matches(pattern)) {
        return Optional.of(Arrays.stream(text.split("\\s+"))
                .collect(Collectors.toList()));
    }
    return Optional.empty();
}

edited Dec 07 '20 at 20:12

answered Dec 07 '20 at 18:16

WJS

36,363
4
24
39

1

thankyou for your answer. i was trying not to bog down the question with too much detail that the idea would get lost. the question really was about parsing multiple entries in the input data with regex rather than those specific values, and in retrospect can understand why some (very good) answers leaned more towards splitting on spaces and similar. It would have helped had I said the input is well formed, so I didn't have to worry about ensuring it matches first before parsing. Tips for me next time I ask a question! – Mark Fisher Dec 07 '20 at 23:32
I understand -- no problems. It wasn't the splitting on spaces that was the issue (at least for me). It was trying to capture a non-repeating group (alphas) following by some quantity of numbers. But that the important thing is that you got an answer you can use. – WJS Dec 08 '20 at 02:10

score 0 · Answer 4 · answered Dec 07 '20 at 18:41

Assuming you are talking about this: adventofcode where the inputs are the rules

light red bags contain 1 bright white bag, 2 muted yellow bags.
dark orange bags contain 3 bright white bags, 4 muted yellow bags.
bright white bags contain 1 shiny gold bag.
muted yellow bags contain 2 shiny gold bags, 9 faded blue bags.
shiny gold bags contain 1 dark olive bag, 2 vibrant plum bags.
dark olive bags contain 3 faded blue bags, 4 dotted black bags.
vibrant plum bags contain 5 faded blue bags, 6 dotted black bags.
faded blue bags contain no other bags.
dotted black bags contain no other bags.

Why search for a complicated regular expression when you can easily split on the word contain or on a ,

String str1 = "light red bags contain 1 bright white bag, 2 muted yellow bags.";
String str2 = "dotted black bags contain no other bags.";
String[] split1 = str1.split("\\scontain\\s|,");
String[] split2 = str2.split("\\scontain\\s|,");

System.out.println(Arrays.toString(split1));
System.out.println(Arrays.toString(split2));

//[light red bags, 1 bright white bag,  2 muted yellow bags.]
//[dotted black bags, no other bags.]

Yes, that's the puzzle for today. I solved it fine, I was just trying to find a single regex to cater for entire line, hence question. I'm actually using Kotlin, but the regex is same between the two. I had used split on space and taking 4 words at a time in my first solution but it was hideously long and convoluted, then refactored to regex removing half the code. I'll post my own solution in the question as it doesn't format well in comments. Thanks for your answer! — Mark Fisher, Dec 07 '20 at 18:55

Multiple matches in single java regexp

4 Answers4

Linked