Get URL from HTML code using a regular expression

Question

Consider:

<div><a href="http://anirudhagupta.blogspot.com/">Anirudha Web blog</a></div>

What is the regular expression to get http://anirudhagupta.blogspot.com/ from the following?

<div><a href="http://anirudhagupta.blogspot.com/">Anirudha Web blog</a></div>

If you suggest something in C# that's good. I also like jQuery to do this.

Don't use regular expressions for processing HTML, it will drive you insane! http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — soulmerge, Nov 30 '09 at 12:35
@soulmerge, I agree with you, but seems he/she just what to grab url addresses, not parse HTML code — Rubens Farias, Nov 30 '09 at 12:40
You don't know that internet url start from http:// https:// — , Nov 30 '09 at 12:41
Are you trying to extract links from plain text, or was the ` — Josh Lee, Nov 30 '09 at 12:43
@Rubens Farias - The URLs are written in HTML, so the HTML code has to be parsed (and entities decoded, etc). — Quentin, Nov 30 '09 at 13:36

score 1 · Answer 1 · answered Nov 30 '09 at 12:49

1

If you want to use jQuery you can do the following.

$('a').attr('href')

answered Nov 30 '09 at 12:49

Peter Stuifzand

5,084
1
23
28

Rubens Farias · Accepted Answer · 2009-11-30T13:50:12.520

0

Quick and dirty:

href="(.*?)"

Ok, let's go with another regex for parsing URLs. This comes from RFC 2396 - URI Generic Syntax: Parsing a URI Reference with a Regular Expression

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

Of course, you can have relative URL address into your HTML code, you'll need to address them in another way; I can recommend you to use C# Uri Constructor (Uri, String).

edited Nov 30 '09 at 13:50

answered Nov 30 '09 at 12:34

Rubens Farias

57,174
8
131
162

no thanks but i say for get programmatically means to say regex to get url – Nov 30 '09 at 12:36
1

@Gupta, I didn't understood your comment; both are regular expressions. – Rubens Farias Nov 30 '09 at 12:46
1

Nice try, but (a) *? for minimal munch is FWIS rare among the world's regex flavours (b) too many mistakes in your second regex to begin listing them – Stewart Nov 30 '09 at 13:43
@Stewart, how about this one? – Rubens Farias Nov 30 '09 at 13:50
@Rubens Farias don't worry i say that how can i do it by regex – Nov 30 '09 at 14:07

score 0 · Answer 3 · edited Sep 09 '11 at 11:21

0

The simplest way to do this is using the following regular expression.

/href="([^"]+)"/

This will get all characters from the first quote until it finds a character that is a quote. This is, in most languages, the fastest way to get a quoted string, that can't itself contain quotes. Quotes should be encoded when used in attributes.

UPDATE: A complete Perl program for parsing URLs would look like this:

use 5.010;

while (<>) {
    push @matches, m/href="([^"]+)"/gi;
    push @matches, m/href='([^']+)'/gi;
    push @matches, m/href=([^"'][^>\s]*)[>\s]+/gi;
    say for @matches;
}

It reads from stdin and prints all URLs. It takes care of the three possible quotes. Use it with curl to find all the URLs in a webpage:

curl url | perl urls.pl

edited Sep 09 '11 at 11:21

Peter Mortensen

30,738
21
105
131

answered Nov 30 '09 at 12:44

Peter Stuifzand

5,084
1
23
28

In the wild, HTML can be a deadly thing. href=a.html is "valid," or at least should work just as well as href="a.html" and in most instances special characters that should be escaped aren't *cough* google *cough* – Gary Green Nov 30 '09 at 14:11
Correct, there are many pitfalls when using information from the web. On the other hand if I need to find the urls from one webpage on which I can see all possible problems (or find out by testing) I will use this regex (or variant) before using heavier tools. Still, this all depends on the situation and this looks like a Get it Done situation. – Peter Stuifzand Nov 30 '09 at 14:59
blah... this won't work at all. Attribute values can have ",' or none delimiters. – Hogan Nov 30 '09 at 17:49

score 0 · Answer 4 · edited Sep 09 '11 at 11:24

0

The right way to do this is to load the HTML into the C# XML parser and then use XPath to query the URLs. This way you don't have to worry about parsing at all.

edited Sep 09 '11 at 11:24

Peter Mortensen

30,738
21
105
131

answered Nov 30 '09 at 17:56

Hogan

69,564
10
76
117

Why write when you can link: http://www.c-sharpcorner.com/UploadFile/shehperu/SimpleXMLParser11292005004801AM/SimpleXMLParser.aspx is a nice simple example. http://developer.yahoo.com/dotnet/howto-xml_cs.html is a more complex one. But as you can see... all you do is read it into the xml object and then query it with xpath. you will then have a list of href attributes. simple. done. – Hogan Dec 01 '09 at 20:40

score -2 · Answer 5 · edited Sep 09 '11 at 11:23

-2

You don't need a complicated regular expression or HTML parser, since you only want to extract links. Here's a generic way to do it.

data="""
<html>
abcd ef ....
blah blah <div><a href="http://anirudhagupta.blogspot.com/">Anirudha Web blog</a></div>
blah  ...
<div><a href="http://mike.blogspot.com/">Mike's Web blog
</a></div>
end...
</html>
"""    
for item in data.split("</a>"):
    if "<a href" in item:
        start_of_href = item.index("<a href") # get where <a href=" is
        print item[start_of_href+len('<a href="'):] # print substring from <a href onwards.

The above is Python code, but the idea behind you can adapt in your C# language. Split your HTML string using "</a>" as delimiter. Go through each split field, check for "href", then get the substr after "href". That will be your links.

edited Sep 09 '11 at 11:23

Peter Mortensen

30,738
21
105
131

answered Nov 30 '09 at 13:33

ghostdog74

327,991
56
259
343

This seems more complex than a regex! – Gary Green Nov 30 '09 at 14:08
complex because it has more words? Would you rather look at an essay written in english or one that is encoded with numbers, each number representing a letter? its the same analogy. what regex does behind is roughly the same as what i posted. string manipulations, except that its presented more clearly to the reader, and not having the reader guessing what your code means – ghostdog74 Nov 30 '09 at 15:00
take for example the regex posted by Ruben. seriously, if you can decipher what it means at first glance, i take my hat off you. – ghostdog74 Nov 30 '09 at 15:01
also here you get to make sure the code is optimized. Who knows what the regex will do... you know is the best place to split the code -- going to be much better than the regex. – Hogan Nov 30 '09 at 17:48
see my comment below -- that is is the "non complex" way to do it. – Hogan Nov 30 '09 at 17:56

Get URL from HTML code using a regular expression

5 Answers5

Linked