0

Can you help me with regex?

I have line

"Sites www.google.com и www.ridd.rdd..com good."

After parse I'v get this type of line:

"Sites http://www.google.com и www.ridd.rdd..com good."

Problem with checking consecutive points. To sites with an error (with two points in a row) "http//:" should not be appended.

My regex:

 Matcher matchr = Pattern.compile("w{3}(\\.\\w+)+[a-z]{2,6}").matcher(text);

        while (matchr.find()) {
            text = text.replace(matchr.group(0), "http://" + matchr.group(0));
        }

        System.out.println(text);
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Yahor Urbanovich
  • 733
  • 1
  • 9
  • 17

1 Answers1

1

Your regex w{3}(\\.\\w+)+[a-z]{2,6} matches a part of the second bad "URL", www.ridd.rdd..com. So, you need to make sure the substring you match has no consecutive dots. You may use word boundaries and a negative lookahead (?!\S*\.{2}).

Use

String text = "Sites www.google.com и www.ridd.rdd..com good.";
text = text.replaceAll("\\b(?!\\S*\\.{2})w{3}(\\.\\w+)+[a-z]{2,6}\\b", "http://$0");
// => Sites http://www.google.com и www.ridd.rdd..com good.

See the IDEONE demo

Pattern explanation:

  • \\b - leading word boundary
  • (?!\\S*\\.{2}) - there should not be any consecutive dots in the non-whitespace chunk to follow
  • w{3} - match www
  • (\\.\\w+)+ - 1+ sequences of . followed with 1+ alphanumeric or underscore characters
  • [a-z]{2,6} - make sure there are 2 to 6 a-z letters...
  • \\b - at the end of this "word"
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563