1

Consider:

<div><a href="http://anirudhagupta.blogspot.com/">Anirudha Web blog</a></div>

What is the regular expression to get http://anirudhagupta.blogspot.com/ from the following?

<div><a href="http://anirudhagupta.blogspot.com/">Anirudha Web blog</a></div>

If you suggest something in C# that's good. I also like jQuery to do this.

DaveRandom
  • 87,921
  • 11
  • 154
  • 174

5 Answers5

1

If you want to use jQuery you can do the following.

$('a').attr('href')
Peter Stuifzand
  • 5,084
  • 1
  • 23
  • 28
0

Quick and dirty:

href="(.*?)"

Ok, let's go with another regex for parsing URLs. This comes from RFC 2396 - URI Generic Syntax: Parsing a URI Reference with a Regular Expression

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

Of course, you can have relative URL address into your HTML code, you'll need to address them in another way; I can recommend you to use C# Uri Constructor (Uri, String).

Rubens Farias
  • 57,174
  • 8
  • 131
  • 162
0

The simplest way to do this is using the following regular expression.

/href="([^"]+)"/

This will get all characters from the first quote until it finds a character that is a quote. This is, in most languages, the fastest way to get a quoted string, that can't itself contain quotes. Quotes should be encoded when used in attributes.

UPDATE: A complete Perl program for parsing URLs would look like this:

use 5.010;

while (<>) {
    push @matches, m/href="([^"]+)"/gi;
    push @matches, m/href='([^']+)'/gi;
    push @matches, m/href=([^"'][^>\s]*)[>\s]+/gi;
    say for @matches;
}

It reads from stdin and prints all URLs. It takes care of the three possible quotes. Use it with curl to find all the URLs in a webpage:

curl url | perl urls.pl
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Peter Stuifzand
  • 5,084
  • 1
  • 23
  • 28
  • In the wild, HTML can be a deadly thing. href=a.html is "valid," or at least should work just as well as href="a.html" and in most instances special characters that should be escaped aren't *cough* google *cough* – Gary Green Nov 30 '09 at 14:11
  • Correct, there are many pitfalls when using information from the web. On the other hand if I need to find the urls from one webpage on which I can see all possible problems (or find out by testing) I will use this regex (or variant) before using heavier tools. Still, this all depends on the situation and this looks like a Get it Done situation. – Peter Stuifzand Nov 30 '09 at 14:59
  • blah... this won't work at all. Attribute values can have ",' or none delimiters. – Hogan Nov 30 '09 at 17:49
0

The right way to do this is to load the HTML into the C# XML parser and then use XPath to query the URLs. This way you don't have to worry about parsing at all.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Hogan
  • 69,564
  • 10
  • 76
  • 117
  • Why write when you can link: http://www.c-sharpcorner.com/UploadFile/shehperu/SimpleXMLParser11292005004801AM/SimpleXMLParser.aspx is a nice simple example. http://developer.yahoo.com/dotnet/howto-xml_cs.html is a more complex one. But as you can see... all you do is read it into the xml object and then query it with xpath. you will then have a list of href attributes. simple. done. – Hogan Dec 01 '09 at 20:40
-2

You don't need a complicated regular expression or HTML parser, since you only want to extract links. Here's a generic way to do it.

data="""
<html>
abcd ef ....
blah blah <div><a href="http://anirudhagupta.blogspot.com/">Anirudha Web blog</a></div>
blah  ...
<div><a href="http://mike.blogspot.com/">Mike's Web blog
</a></div>
end...
</html>
"""    
for item in data.split("</a>"):
    if "<a href" in item:
        start_of_href = item.index("<a href") # get where <a href=" is
        print item[start_of_href+len('<a href="'):] # print substring from <a href onwards. 

The above is Python code, but the idea behind you can adapt in your C# language. Split your HTML string using "</a>" as delimiter. Go through each split field, check for "href", then get the substr after "href". That will be your links.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
  • This seems more complex than a regex! – Gary Green Nov 30 '09 at 14:08
  • complex because it has more words? Would you rather look at an essay written in english or one that is encoded with numbers, each number representing a letter? its the same analogy. what regex does behind is roughly the same as what i posted. string manipulations, except that its presented more clearly to the reader, and not having the reader guessing what your code means – ghostdog74 Nov 30 '09 at 15:00
  • take for example the regex posted by Ruben. seriously, if you can decipher what it means at first glance, i take my hat off you. – ghostdog74 Nov 30 '09 at 15:01
  • also here you get to make sure the code is optimized. Who knows what the regex will do... you know is the best place to split the code -- going to be much better than the regex. – Hogan Nov 30 '09 at 17:48
  • see my comment below -- that is is the "non complex" way to do it. – Hogan Nov 30 '09 at 17:56