2

I'm trying to parse all the links in a message.

My Java-Code looks the following:

Pattern URLPATTERN = Pattern.compile(
    "([--:\\w?@%&+~#=]*\\.[a-z]{2,4}/{0,2})((?:[?&](?:\\w+)=(?:\\w+))+|[--:\\w?@%&+~#=]+)?",
    Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);

Matcher matcher = Patterns.URLPATTERN.matcher(message);
ArrayList<int[]> links = new ArrayList<>();
while (matcher.find())
    links.add(new int[] {matcher.start(1), matcher.end()});
[...]

The problem now is that the links sometimes start with a colour-code that looks the following: [&§]{1}[a-z0-9]{1}

An example could be: Please use Google: §ehttps://google.com, and don't ask me.

With the regex expression, I found somewhere on the internet it will match the following: ehttps://google.com but it should only match https://google.com

Now how can I change the regular expression above to exclude the following pattern but still match the link that follows just after the color-code?

[&§]{1}[a-z0-9]{1}
Hardik Yewale
  • 340
  • 1
  • 8
Nicola Uetz
  • 848
  • 1
  • 7
  • 25
  • Does this answer your question? [What is the best regular expression to check if a string is a valid URL?](https://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url) – markspace Oct 11 '20 at 16:22
  • 1
    @markspace Of course it does not, it is not a URL validation problem. – Wiktor Stribiżew Oct 11 '20 at 16:27

1 Answers1

2

You can add a (?:[&§][a-z0-9])? pattern (matching an optional sequence of a & or § and then an ASCII letter or digit) at the beginning of your regex:

Pattern URLPATTERN = Pattern.compile(
    "(?:[&§][a-z0-9])?([--:\\w?@%&+~#=]*\\.[a-z]{2,4}/{0,2})((?:[?&]\\w+=\\w+)+|[--:\\w?@%&+~#=]+)?", Pattern.CASE_INSENSITIVE);

See the regex demo.

When the regex finds §ehttps://google.com, the §e is matched with the optional non-capturing group (?:[&§][a-z0-9])?, that is why it is "excluded" from the Group 1 value.

There is no need using Pattern.MULTILINE | Pattern.DOTALL with your regex, there is no . and no ^/$ in the pattern.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563