-1

Currently using this :

MatchCollection urlRegExp = Regex.Matches(text, @"https?://(?:www\.)?[a-zA-Z0-9/;&=:_$+!*'()|~\[\]#%.\\-]+");

to replace urls in a text but it doesn't get every urls, and it seems like i cannot find a good regex for this.. anyone can help?

user3352374
  • 37
  • 2
  • 10
  • 2
    possible duplicate of [What is the best regular expression to check if a string is a valid URL?](http://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url) – PM 77-1 Sep 02 '14 at 01:46
  • which platform are you using? – Steve Sep 02 '14 at 01:47
  • @PM77-1: He/She doesn't try to check if an url is valid or not, but only to find urls in a text, that are two totally different tasks. – Casimir et Hippolyte Sep 02 '14 at 01:54
  • @PM u r rite, this is the same task I believe. What is the difference? If a sub string matches an url then the job is half done. Just need to substitute the replace character. – Arun A K Sep 02 '14 at 01:58

1 Answers1

1

If you need find urls in a text, you don't need to fit the RFC (whatever the number) it's totally useless (and it's nearly impossible with a pattern that follows the standard, it will be too slow, too complex).

All urls in the text should be considered valid (and / or must be validated or not before being inserted in the text by the people who produce this text. In other words, it is not your job!).

So, you must find an other approach. To do this, you must ask the right question: how to distinguish a URL from the text?

Let's list the common criteria

  • a URL may begin with the protocol: http, https, ftp, sftp, ftps, gopher, ...
  • a URL may begin with www.
  • a URL does not contain whitespace characters
  • a URL begins always with a word boundary
  • a URL may ends before a whitespace character, the end of the string, a punctuation character except the question mark (that can be present even if there are no GET parameters)

With these requirements, you can build easily a naive pattern for the http protocol:

\b(https?://|www\.)\S+(?=\s|[^\P{P}?]|\z)

Note that once you obtain a result, you are free to check the validity of the url with a build-in function (which generally doesn't handle all the cases however, but now you know why:).

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • Yea but will it work with urls that i am not aware of like if the url is something like that : testone123.me or something similar.. – user3352374 Sep 06 '14 at 20:23
  • @user3352374: Probably, but there is no way to know if "testone123.me" is a domain name (and by extension an url) or a simple text element. If you build for example a pattern to detect that: `\b\w+\.\w+`, all substrings that looks like this would be matched, and you will obtain false positive results. (imagine someone that forgets a space after a dot: `Bob has got a cat.me, I have a dog.` – Casimir et Hippolyte Sep 06 '14 at 20:33
  • Yea, but would your code you just gave me detect a facebook url or anything like that? what about test123.me/fsifi8e3393 ? :O – user3352374 Sep 06 '14 at 20:39
  • @user3352374: In its current state, obviously not. Remember that my approach is voluntary naive and will only detect substrings that begins with `http://` or `www` *(in other words, this will detect the intention to write an URL, nothing more)*. If you want to detect URLs without `http://` or `www` but that have well-known domains, it's also possible to include them in the pattern with an alternation : `\b(https?://|www\.|domain1\.com|domain2\.com)\S*(?=\s|[^\P{P}?]|\z)`. But looking for all possible urls hidden in a text by a syntax analysis is really (in my opinion) a waste of time. – Casimir et Hippolyte Sep 06 '14 at 22:20
  • In particular, I am thinking about forums that try to forbid users to write urls in a post in the 90s. The pattern or string search to detect URLs was based on the `http://` substring. Immediatly after, people that want to write an url in a post wrote: `h**p://` to avoid the detection. – Casimir et Hippolyte Sep 06 '14 at 22:26