0

I have a regular expression that matches words with . in between them as potential urls but not those with @ in front of them as they are assumed to be emails.

This is the regex that I have:

(?:\@(https?:\/\/)?(\w+(\-*\w+)*\.)[a-zA-Z\.]+[\w+\/?\#?\??\=\%\&\-]+.*?)*\K(https?:\/\/)?(\w+(\-*\w+)*\.)[a-zA-Z\.]+[\w+\/?\#?\??\=\%\&\-]+

This is not working for the last occurrence of email perfectly.

For example, for the string

twitter.com facebook.com kamur@test.com ksou@uni.edu vimal@gsomething.com balaji@sweets.com john wayne <johnwayne@dc.com> 20,000.00

I expect the matches to be twitter.com and facebook.com.

But it also matches dc.com.

dexteran
  • 95
  • 8

1 Answers1

4

In your (?:\@(https?:\/\/), the ? in https?: will match either http or https. The ? literally means 0 or 1 of the character s. The : you refer to in https?: is matching a literal :, nothing special.

Now, the difference is if your ?: comes after a non-escaped opening parenthesis, then that means it's a non-capturing group.

Escaped: \(?:, not a non-capturing group
Not-Escaped: (?:, is a non-capturing group


The next portion of your question, what does the .*? in [\w+\/?\#?\??\=\%\&\-]+.*? refer to?

  • . will match any character
  • * is a quantifier that will match your . (any character) 0 to unlimited times
  • *? makes * non-greedy. An internet search will provide you with a lot of information on what a non-greedy match is if you are unaware.
K.Dᴀᴠɪs
  • 9,945
  • 11
  • 33
  • 43