0

Assume my string is

http://www.test.com\r\nhttp://www.hello.com<some text here>http://www.world.com

I want to extract all URLs in the string. The output should be as follows:

http://www.test.com
http://www.hello.com
http://www.world.com

How can I achieve that?

There is no html tag in the string so extracting them using HTMLAgilityPack is not a viable option.

Ωmega
  • 42,614
  • 34
  • 134
  • 203
user1295450
  • 167
  • 1
  • 4
  • 8
  • 3
    Is http the only protocol you want to extract? What about https or ftp? – Mark Byers Jul 28 '12 at 22:02
  • I have posted an answer that would work with protocol prefixes `http:\\ `, `https:\\ ` and **also without protocol prefix**, if `www.` is present. I believe that is what you need, as links with `www.` don't have to be always listed with protocol... Good luck! – Ωmega Jul 28 '12 at 23:18

3 Answers3

4

Among the other answers and comments, the easiest approach I can actually implement is the Split way. You know there is lots of blind guess here and one of the best bet to take it all may be this:

using System.Text.RegularExpressions;

public static List<string> ParseUrls(string input) {
    List<string> urls = new List<string>();
    const string pattern = "http://"; //here you may use a better expression to include ftp and so on
    string[] m = Regex.Split(input, pattern);
    for (int i = 0; i < m.Length; i++)
        if (i % 2 == 0){
            Match urlMatch = Regex.Match(m[i],"^(?<url>[a-zA-Z0-9/?=&.]+)", RegexOptions.Singleline);
            if(urlMatch.Success)
                urls.Add(string.Format("http://{0}", urlMatch.Groups["url"].Value)); //modify the prefix according to the chosen pattern                            
        }
    return urls;
}
Diego D
  • 6,156
  • 2
  • 17
  • 30
  • oh well we have an hero here...who loves downvoting without commenting the reason. May I ask you why do you think is not going to work? – Diego D Jul 28 '12 at 22:36
  • Yeah, I see nothing wrong with what you posted... someone has gone through and just down-voted a bunch of our posts for no reason, ha. – Vaughan Hilts Jul 28 '12 at 22:42
  • yes maybe it was not the best solution given...and honestly it's inspired to previously given advices (and I clearly stated on my answer) but it was unfair to downvote with no reason. sometimes it's really demotivational because you spend time to answer and you just get penalized. Too bad. – Diego D Jul 28 '12 at 22:48
  • I gave you an upvote to at least offset for the troll that is running around here.. (and considering your solution answers the question and provides no BAD avice) – Vaughan Hilts Jul 28 '12 at 22:50
  • I'm glad you accepted my answer but please be sure it actually works in your very specific scenario. Because I can see the are some holes in the logic and if you have further issues just point them out. – Diego D Jul 29 '12 at 11:01
  • Test it: adasdadasdada url('http:// aaaaa.com/bbb.eot') asdasd accvxcvxcv url('https:// aaaaa.com/ccc.eot') 345345345345 –  Feb 22 '15 at 16:42
  • I don't understand the "if (i % 2 == 0)" – Florian Jan 08 '16 at 21:26
0

Since ":" is not a valid character in a URL, it can be assumed that when you search for "http://" that you will be given a good, valid start of a URL.

Search for this and find your start.

You could construct a list of known good TLDs you may encounter (this will help: http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains)

You know this will be your ending point; so you can do a search on these from the beginning of the string.

Start from the beginning, and start from this index. Skip everything after it, it's no good.

I'm assuming you have no sub-directories; since you hadn't listed any of them.

Vaughan Hilts
  • 2,839
  • 1
  • 20
  • 39
0

You could use the string splitting logic from this question by searching and splitting for/by "http://". If you do need the "http://" part, you could always add it later.

Edit: Note that you would have to search and filter for (things like?) \r\n in at the end of each URL afterwards, but that should not be a big problem...

Community
  • 1
  • 1
Kjartan
  • 18,591
  • 15
  • 71
  • 96