Extract all values between double quotes from a webpage’s source code

Question

I have saved the source code of a webpage (the option in every browser); now I want to catch everything between quotes that starts with http://. How can I do that?

Are you trying to extract some particular attribute values? An HTML parser could be more useful. — Ry-, Apr 28 '13 at 15:14
actually the source code of a page may contain links, images etc.. internal link are usually present without "http" so, I'm not interested in them.. thus, everything with http, it may be an external link or an image etc. — Junaid Rehman, Apr 28 '13 at 15:57

score 1 · Answer 1 · answered Apr 28 '13 at 16:16

Using HTML Agility Pack

string path = ...
var doc = new HtmlDocument();
doc.Load(path);
var links =
    from e in doc.DocumentNode.Descendants()
    from a in e.Attributes
    where a.Value.StartsWith("http://")
    select a.Value;

(note that it only returns links that are in HTML attributes, not in plain text)

score 0 · Accepted Answer · answered Apr 28 '13 at 16:14

Use regex:

Dim mc As MatchCollection = Regex.Matches(html, """(http://.+?)""", RegexOptions.IgnoreCase)

For Each m As Match In mc
    Console.WriteLine(m.Groups(1).Value)
Next

Sample output when html = the source code of this page:

http://cdn.sstatic.net/stackoverflow/img/favicon.ico
http://cdn.sstatic.net/stackoverflow/img/apple-touch-icon.png
http://cdn.sstatic.net/js/stub.js?v=181da36f6419
http://cdn.sstatic.net/stackoverflow/all.css?v=0f0c93534e2b
http://stackoverflow.com/questions/16264292/extract-all-values-between-double-quotes-from-a-webpages-source-code
http://www.gravatar.com/avatar/91d33760d2823fa7cf5c95b41a16fada?s=32&d=identicon&r=PG\
http://stackoverflow.com/users/2264365/ajakblackgoat
http://stackexchange.com
http://chat.stackoverflow.com
... etc

thanks brother.. it worked perfectly :D can you please tell me how to make these patterns? — Junaid Rehman, Apr 28 '13 at 16:52

Extract all values between double quotes from a webpage’s source code

2 Answers2