Regex match between delimiters

Question

I am trying to figure out a way to match domain name in a config file. My file may look like this:

<abc="xyz">abc.com</abc>

I want to match abc.com and replace it with placeholder text. My current Java solution was replacing exact domain in text by calling StrName.replace("abc.com","random") However, it will also replace abc.commmmm with randommmmm. I don't want that. So, I tried regex. However, my regex [.><](abc.com)[.<>] will also select > and < in the string. I don't want that. I also realized that the text could be something like this:

<abc="xyz">text.abc.com.net</abc>

I still want to replace abc.com, so my regex won't work. How do I correct this?

Related: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?rq=1 — Isaac, Jun 19 '17 at 23:45
Are you looking for [word boundaries](https://docs.oracle.com/javase/tutorial/essential/regex/bounds.html)? As in `\b(abc\.com)\b`? (Edit: escape that dot!) — Ken Y-N, Jun 19 '17 at 23:47
Try regex [`(abc.com(?!m))`](https://regex101.com/r/TiGUKL/1) — SomeDude, Jun 19 '17 at 23:49
`(?<=[.><])abc.com(?=[.><])` I'm unable to test this but give this a try it uses positive lookbehind and positive lookahead — Isaac, Jun 19 '17 at 23:53
@svasa I think he might be after positive look(ahead|behind) rather than negative — Isaac, Jun 19 '17 at 23:54
@Maxsteel you might be able to piece something together from these http://www.regular-expressions.info/lookaround.html I'm not sure of the support in java though — Isaac, Jun 19 '17 at 23:57
@MaxSteel, as I understand your requirement you don't want `abc.commm` but `abc.com`, that regex matches only if there are no extra `m`s , uses [`negative look ahead`](http://www.rexegg.com/regex-lookarounds.html) — SomeDude, Jun 20 '17 at 00:03
@svasa I believe the extra `m`s was an example of something not to match, not the only thing to not match which is why negative look(ahead | behind) won't work. I suspect abc.combbbb would also be a match @Maxsteel would want to avoid — gwcoderguy, Jun 20 '17 at 00:06
@svasa that wouldn't work as you'd get matches for `abc.com!!!` and `mmmmabc.com` and `abcdcom` — gwcoderguy, Jun 20 '17 at 00:21
All of the comments about "but it'll also match" and "but it won't match" on this question and on gwcoderguy's answer are an indication that the _specification is insufficient_ – the problem statement is unclear. Are these statements correct? (a) match the literal string `abc.com` (b) only when it appears between `>` and `<` (c) other text can optionally also appear between `>` and `<` (d) but only letters and dots (d) what else? You can't write a regex unless you can define _exactly_ what you do _and don't_ want to match. — Stephen P, Jun 20 '17 at 00:27

gwcoderguy · Answer 1 · 2017-06-20T00:24:53.230

You may want to consider using positive look-ahead and positive look-behind in your regex. See: http://www.regular-expressions.info/lookaround.html

Positive look-ahead is written like THING1(?=THING2) and it means find THING1 followed by THING2.

Positive look-behind is written like (?<=THING1)THING2 and it means find THING2 followed by THING1

In both of these cases the THING within the lookahead will not be consumed. For your first example you could do something like: (?<=>)abc\.com(?=<)

meaning abc.com preceded by > from (?<=>) and followed by < from (?=<).

If you are also looking to replace abc.com between periods as in text.abc.com.... you can try:

(?<=[.><])abc\.com(?=[.><])

meaning abc.com preceded by <,> or . from (?<=[.><]) and followed by <,> or . from (?=[.><]).

This will give you <abc="xyz">random</abc> and <abc="xyz">text.random.net</abc> for your two examples, as well as no match for <abc="xyz">abc.commmm</abc>

NOTE: escape the period by using \. instead of . otherwise you will match abcdcom as . matches any charachter. See: http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html

Thanks, but test.abc.com.common should also give a match, and this wouldn't, right? — Maxsteel, Jun 20 '17 at 00:02
So you want to replace abc.com when it is preceded by a < or a . ? Otherwise, I'm not sure of the distinction between text.abc.com and abc.commm — gwcoderguy, Jun 20 '17 at 00:03

score 0 · Answer 2 · answered Jun 20 '17 at 18:02

Besides using look-arounds, you can do this with standard capturing groups.

First, define what you're looking for...

Find abc.com when it appears within tags — enclosed within > and <
and optionally has additional dotted prefixes and/or suffixes.
Replace abc.com with random when found in these circumstances.

This looks to me like a multi-level domain with two or more segments, which abc.com must be a part of.

What are the parts we're looking for?

The fixed string abc.com – the re for that is abc\.com escaping the dot so it's a literal instead of "any character".
Optional domain parts preceding "abc.com" – letters followed by a dot [a-z]+\.
... but there can be zero or more of them, so ([a-z]+\.)*
Optional domain parts following "abc.com" – a dot followed by letters, zero
or more times (\.[a-z]+)*
All of that enclosed within the end of a start tag > and the start of an end tag < ... so >something<

Putting that all together we get >([a-z]+\.)*abc\.com(\.[a-z]+)*< which needs to be escaped to be a Java string ">([a-z]+\\.)*abc\\.com(\\.[a-z]+)*<"

Now, since we are matching and consuming the > and < we'll need them in the replacement string, and we're capturing the prefix and suffix so we need to put those in the replacement also using the capturing groups 1 and 2, giving >$1random$2<

I put this re in at regexplanet.com http://fiddle.re/fxhfxn using these test strings and producing the replacement strings:

<abc="xyz">abc.com</abc>           =>  <abc="xyz">random</abc>
<abc="xyz">abc.commmmm</abc>       =>  <abc="xyz">abc.commmmm</abc>
<abc="xyz">text.abc.com.net</abc>  =>  <abc="xyz">text.random.net</abc>
[abc="xyz"]text.abc.com.net[/abc]  =>  [abc="xyz"]text.abc.com.net[/abc]    
<abc="xyz">abc-com</abc>           =>  <abc="xyz">abc-com</abc>
<abc="xyz">abc.com.fr</abc>        =>  <abc="xyz">random.fr</abc>
<abc="xyz">www.abc.com</abc>       =>  <abc="xyz">www.random</abc>

Regex match between delimiters

2 Answers2