Ignore creating beginnings of words in a regular expression

Question

I'm trying to parse all the links in a message.

My Java-Code looks the following:

Pattern URLPATTERN = Pattern.compile(
    "([--:\\w?@%&+~#=]*\\.[a-z]{2,4}/{0,2})((?:[?&](?:\\w+)=(?:\\w+))+|[--:\\w?@%&+~#=]+)?",
    Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);

Matcher matcher = Patterns.URLPATTERN.matcher(message);
ArrayList<int[]> links = new ArrayList<>();
while (matcher.find())
    links.add(new int[] {matcher.start(1), matcher.end()});
[...]

The problem now is that the links sometimes start with a colour-code that looks the following: [&§]{1}[a-z0-9]{1}

An example could be: Please use Google: §ehttps://google.com, and don't ask me.

With the regex expression, I found somewhere on the internet it will match the following: ehttps://google.com but it should only match https://google.com

Now how can I change the regular expression above to exclude the following pattern but still match the link that follows just after the color-code?

[&§]{1}[a-z0-9]{1}

Does this answer your question? [What is the best regular expression to check if a string is a valid URL?](https://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url) — markspace, Oct 11 '20 at 16:22
@markspace Of course it does not, it is not a URL validation problem. — Wiktor Stribiżew, Oct 11 '20 at 16:27

Wiktor Stribiżew · Accepted Answer · 2020-10-11T16:26:35.623

You can add a (?:[&§][a-z0-9])? pattern (matching an optional sequence of a & or § and then an ASCII letter or digit) at the beginning of your regex:

Pattern URLPATTERN = Pattern.compile(
    "(?:[&§][a-z0-9])?([--:\\w?@%&+~#=]*\\.[a-z]{2,4}/{0,2})((?:[?&]\\w+=\\w+)+|[--:\\w?@%&+~#=]+)?", Pattern.CASE_INSENSITIVE);

See the regex demo.

When the regex finds §ehttps://google.com, the §e is matched with the optional non-capturing group (?:[&§][a-z0-9])?, that is why it is "excluded" from the Group 1 value.

There is no need using Pattern.MULTILINE | Pattern.DOTALL with your regex, there is no . and no ^/$ in the pattern.

Ignore creating beginnings of words in a regular expression

1 Answers1