-1

I am trying to detect missing groups from my Java regex, but it seems like there isn't a way to do it.

        Pattern pattern = Pattern.compile("(from) ([\d]*)( to )([\d]*)");
        Matcher match = pattern.matcher(input);
        if(match.find()) { // match }

This is my current regex structure, for a command: from START_DAY to END_DAY. It is capable of detecting when the full regex is met, but I want to do some validation, e.g.

Detect if the START_DAY is missing, or the END_DAY is missing.

Is there a way to detect missing groups?

The correct full command is: from START_DAY to END_DAY (e.g. from 10 to 20), but if a user enters a wrong command for example: from 10, I want to feedback to them that END_DAY is missing

Thanks

Iva l
  • 13
  • 3
  • Do you need to make these groups optional? Try `"from(?: (\\d+)(?: to (\\d+))?)?"` – Wiktor Stribiżew Jan 18 '21 at 12:17
  • The correct full command is: from START_DAY to END_DAY, but if a user enters a wrong command for example: from START_DAY, I want to feedback to them that END_DAY is missing – Iva l Jan 18 '21 at 12:18
  • Is this somewhat related to this question regarding HTML parsing by regex? Interesting read but the upshot is that regex is not suited to the task. https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – James Webb Jan 18 '21 at 12:22
  • 2
    My opinion is that you're over-engineering the solution. Being able to receive *arbitrary input* and form an *optimal* error message is going to be a messy problem.... Why not just display a static message, like: "Expected format: **from START_DAY to END_DAY**", and let the user use common sense to see how their input doesn't match that? – Tom Lord Jan 18 '21 at 12:25
  • @TomLord this is part of my school work! Maybe I am thinking too much. My friends have simply split the string via spaces and work around that. I thought regex would work, but it seems impossible. If all else fails, I would just do it in another manner! – Iva l Jan 18 '21 at 12:28
  • If the scope of this error message can be limited to merely checking three possible error messages for one specific input format, then you could do it via an explicit `if ... else if ... else if ....` -- But to do this *in general*, I wouldn't recommend trying to over-engineer the error messages; a straightforward "error: wrong format" is proabably sufficient. – Tom Lord Jan 18 '21 at 12:31

1 Answers1

1

Yes, you can detect missing groups. They return as null.

It is also not at all relevant to your actual regexp. 'missing groups' in terms of regular expressions involves something like: foo(bar(baz))? - where the entire bar(baz) part is optional due to the question mark, but if that entire part isn't in the string, then the group inside this optional part is a missing group, which is returned by the Matcher object as null.

However, you don't have any missing groups. It is not possible with the regexp you have in your question. "" (the empty string) is a correct match for the regexp [\d]*. The empty string is, after all, '0 or more digits'. 0 digits is a valid interpretation of '0 or more'. Thus, you are not missing a group there - it's there. As an empty string.

Which you could detect if you wanted to: match.group(2) would be .equals("").

Had for example from not been how the input starts, then the regexp would simply fail to match. If you are trying to write a tool that tries to intelligently tell you which part(s) of the input string are missing, oh boy - that is an incredibly complex story that involves parsers: The tools that e.g. javac and other language compilers use to parse text files. It's incredibly complicated, and there are multiple libraries and algorithms to go about it.

For something as simple as this you could presumably handroll it: Make regexps that would match expectable but wrong inputs and if those match, print out an error string. For more complex grammars this soon grows into an exponential mess and we're back to: Yes, it is far more complicated than it sounds, thus, academics have written many papers on the topic. There's LL(k) parsers, LALR, packrat, negative-memoizing-only packrat, and more. They have different properties; some are fast but give relatively bad syntax error info and can't handle all grammars. Others could be fed input that will cause them to take years to parse (or gigabytes of memory even for a relatively small input), and still others are really hard to actually use as a programmer (as in, to write the grammar). I don't know about your level of skill but this isn't something I'd advise for a first or even second year java programmer.

rzwitserloot
  • 85,357
  • 5
  • 51
  • 72