Regex recursion captured string

Question

I have a problem with a regex that has to capture a substring that it's already captured...

I have this regex:

(?<domain>\w+\.\w+)($|\/|\.)

And I want to capture every subdomain recursively. For example, in this string:

test1.test2.abc.def

This expression captures test1.test2 and abc.def but I need to capture: test1.test2 test2.abc abc.def

Do you know if there is any option to do this recursively?

Thanks!

What regex flavor are you using? Some support recursive match. — Schwern, Feb 20 '20 at 07:35
So you're saying that it's possible a regex to make match text that does not belong to that text in the first place @Schwern? — Themelis, Feb 20 '20 at 07:38
Note that domain names include `-` and exclude `_`. `[a-zA-Z0-9-]` is a better approximation. See this answer for a proper regex. https://stackoverflow.com/questions/60269926/validate-format-of-subdomain/60271196#60271196 — Schwern, Feb 20 '20 at 07:39
@Themelis I'm thinking [`(?R)`](https://www.rexegg.com/regex-recursion.html) might be useful. Not sure what you're referring to. — Schwern, Feb 20 '20 at 07:46
Have you had time to check my suggestion? Others' suggestions? Did anything work for you? — Wiktor Stribiżew, Feb 21 '20 at 11:25

JvdV · Answer 1 · 2020-02-20T07:59:04.583

3

Maybe the following:

(\.|^)(?=(\w+\.\w+))

Go with capturing group 2

edited Feb 20 '20 at 07:59

answered Feb 20 '20 at 07:44

JvdV

70,606
8
39
70

score 1 · Answer 2 · answered Feb 20 '20 at 08:01

~~You can use a positive look ahead to capture the next group.~~

/(\w+)\.(?=(\w+))/g

Demonstration.

Edit: JvdV's regex is more correct.

Note that \w+ is will fail to match domains like regex-tester.com and will match invalid regex_tester.com. [a-zA-Z0-9-]+ is closer to correct. See this answer for a complete regex.

It's simpler and more robust to do this by splitting on . and iterating through the pieces in pairs. For example, in Ruby...

"test1.test2.abc.def".split(".").each_cons(2) { |a|
  puts a.join(".")
}

test1.test2
test2.abc
abc.def

score 0 · Accepted Answer · answered Feb 20 '20 at 08:08

You may use a well-known technique to extract overlapping matches, but you can't rely on \b boundaries as they can match between a non-word / word char and word / non-word char. You need unambiguous word boundaries for left and right hand contexts.

Use

(?=(?<!\w)(?<domain>\w+\.\w+)(?!\w))

See the regex demo. Details:

(?= - a positive lookahead that enables testing each location in the string and capture the part of string to the right of it
- (?<!\w) - a left-hand side word boundary
- (?<domain>\w+\.\w+) - Group "domain": 1+ word chars, . and 1+ word chars
- (?!\w) - a right-hand side word boundary
) - end of the outer lookahead.

Another approach is to use dots as word delimiters. Then use

(?=(?<![^.])(?<domain>[^.]+\.[^.]+)(?![^.]))

See this regex demo. Adjust as you see fit.

Regex recursion captured string

3 Answers3