Repeated capturing groups usually capture only the last iteration. This is true for Kotlin, as well as Java, as the languages do not have any method that would keep track of each capturing group stack.
What you may do as a workaround, is to first validate the whole string against a certain pattern the string should match, and then either extract or split the string into parts.
For the current scenario, you may use
val text = "A-3M+2D"
if (text.matches("""A(?:[+-]\d{1,2}[YMD])*""".toRegex())) {
val results = text.split("(?=[-+])".toRegex())
println(results)
}
// => [A, -3M, +2D]
See the Kotlin demo
Here,
text.matches("""A(?:[+-]\d{1,2}[YMD])*""".toRegex())
makes sure the whole string matches A
and then 0 or more occurrences of +
or -
, 1 or 2 digits followed with Y
, M
or D
.split("(?=[-+])".toRegex())
splits the text with an empty string right before a -
or +
.
Pattern details
^
- implicit in .matches()
- start of string
A
- an A
substring
(?:
- start of a non-capturing group:
[+-]
- a character class matching +
or -
\d{1,2}
- one to two digits
[YMD]
- a character class that matches Y
or M
or D
)*
- end of the non-capturing group, repeat 0 or more times (due to *
quantifier)
\z
- implicit in matches()
- end of string.
When splitting, we just need to find locations before -
or +
, hence we use a positive lookahead, (?=[-+])
, that matches a position that is immediately followed with +
or -
. It is a non-consuming pattern, the +
or -
matched are not added to the match value.
Another approach with a single regex
You may also use a \G
based regex to check the string format first at the start of the string, and only start matching consecutive substrings if that check is a success:
val regex = """(?:\G(?!^)[+-]|^(?=A(?:[+-]\d{1,2}[YMD])*$))[^-+]+""".toRegex()
println(regex.findAll("A-3M+2D").map{it.value}.toList())
// => [A, -3M, +2D]
See another Kotlin demo and the regex demo.
Details
(?:\G(?!^)[+-]|^(?=A(?:[+-]\d{1,2}[YMD])*$))
- either the end of the previous successful match and then +
or -
(see \G(?!^)[+-]
) or (|
) start of string that is followed with A
and then 0 or more occurrences of +
/-
, 1 or 2 digits and then Y
, M
or D
till the end of the string (see ^(?=A(?:[+-]\d{1,2}[YMD])*$)
)
[^-+]+
- 1 or more chars other than -
and +
. We need not be too careful here since the lookahead did the heavy lifting at the start of string.