39

I need a regex which will do the following

Extract all strings which starts with http://
Extract all strings which starts with www.

So i need to extract these 2.

For example there is this given string text below

house home go www.monstermmorpg.com nice hospital http://www.monstermmorpg.com this is incorrect url http://www.monstermmorpg.commerged continue

So from the given above string i will get

    www.monstermmorpg.com
http://www.monstermmorpg.com
http://www.monstermmorpg.commerged

Looking for regex or another way. Thank you.

C# 4.0

Furkan Gözükara
  • 22,964
  • 77
  • 205
  • 342
  • Recently bots pop up to send urls to my game players. I will disallow this :) Though i need to allow internal links. – Furkan Gözükara May 14 '12 at 01:53
  • Perhaps you should consider NOT using regex as it's an awkward approach to parsing HTML... http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not – Tom May 05 '14 at 10:13

3 Answers3

96

You can write some pretty simple regular expressions to handle this, or go via more traditional string splitting + LINQ methodology.

Regex

var linkParser = new Regex(@"\b(?:https?://|www\.)\S+\b", RegexOptions.Compiled | RegexOptions.IgnoreCase);
var rawString = "house home go www.monstermmorpg.com nice hospital http://www.monstermmorpg.com this is incorrect url http://www.monstermmorpg.commerged continue";
foreach(Match m in linkParser.Matches(rawString))
    MessageBox.Show(m.Value);

Explanation Pattern:

\b       -matches a word boundary (spaces, periods..etc)
(?:      -define the beginning of a group, the ?: specifies not to capture the data within this group.
https?://  - Match http or https (the '?' after the "s" makes it optional)
|        -OR
www\.    -literal string, match www. (the \. means a literal ".")
)        -end group
\S+      -match a series of non-whitespace characters.
\b       -match the closing word boundary.

Basically the pattern looks for strings that start with http:// OR https:// OR www. (?:https?://|www\.) and then matches all the characters up to the next whitespace.

Traditional String Options

var rawString = "house home go www.monstermmorpg.com nice hospital http://www.monstermmorpg.com this is incorrect url http://www.monstermmorpg.commerged continue";
var links = rawString.Split("\t\n ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries).Where(s => s.StartsWith("http://") || s.StartsWith("www.") || s.StartsWith("https://"));
foreach (string s in links)
    MessageBox.Show(s);
Jason Larke
  • 5,289
  • 25
  • 28
  • 8
    The regex in the answer does not work if you want to parse a part of HTML string. Use the following one instead: `@"http(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?"` – Nikita R. Apr 22 '14 at 21:21
  • 2
    The regex `@"\b(?:https?://|www\.)[^ \f\n\r\t\v\]]+\b"` works a little better (in my case anyway) as if the URL is enclosed in BB tags it will include `]` as part of the URL. – Tom Gullen Sep 16 '15 at 14:04
  • 2
    @TomGullen Fair point. However, square brackets are actually valid URL characters (according to the RFC spec) so I'll leave the answer as-is as it's just for the most general case. – Jason Larke Sep 17 '15 at 04:42
3

Using Nikita's reply, I get the url in string very easy :

using System.Text.RegularExpressions;

string myString = "test =) https://google.com/";

Match url = Regex.Match(myString, @"http(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?");

string finalUrl = url.ToString();
Diego Montania
  • 322
  • 5
  • 12
0

Does not work with html containing URL

For e.g.

<table><tr><td class="sub-img car-sm" rowspan ="1"><img src="https://{s3bucket}/abc/xyzxyzxyz/subject/jkljlk757cc617-a560-48f5-bea1-f7c066a24350_202008210836495252.jpg?X-Amz-Expires=1800&X-Amz-Algorithm=abcabcabc&X-Amz-Credential=AKIAVCAFR2PUOE4WV6ZX/20210107/ap-south-1/s3/aws4_request&X-Amz-Date=20210107T134049Z&X-Amz-SignedHeaders=host&X-Amz-Signature=3cc6301wrwersdf25fb13sdfcfe8c26d88ca1949e77d9e1d9af4bba126aa5fa91a308f7883e"></td><td class="icon"></td></tr></table>

For that need to use below Regular Expression

Regex regx = new Regex("http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);        
Rohil Patel
  • 386
  • 3
  • 8