0

I have saved the source code of a webpage (the option in every browser); now I want to catch everything between quotes that starts with http://. How can I do that?

Ry-
  • 218,210
  • 55
  • 464
  • 476
Junaid Rehman
  • 169
  • 4
  • 11
  • 1
    Use the HTML Agility Pack – SLaks Apr 28 '13 at 15:14
  • Are you trying to extract some particular attribute values? An HTML parser could be more useful. – Ry- Apr 28 '13 at 15:14
  • actually the source code of a page may contain links, images etc.. internal link are usually present without "http" so, I'm not interested in them.. thus, everything with http, it may be an external link or an image etc. – Junaid Rehman Apr 28 '13 at 15:57

2 Answers2

1

Using HTML Agility Pack

string path = ...
var doc = new HtmlDocument();
doc.Load(path);
var links =
    from e in doc.DocumentNode.Descendants()
    from a in e.Attributes
    where a.Value.StartsWith("http://")
    select a.Value;

(note that it only returns links that are in HTML attributes, not in plain text)

Thomas Levesque
  • 286,951
  • 70
  • 623
  • 758
0

Use regex:

Dim mc As MatchCollection = Regex.Matches(html, """(http://.+?)""", RegexOptions.IgnoreCase)

For Each m As Match In mc
    Console.WriteLine(m.Groups(1).Value)
Next

Sample output when html = the source code of this page:

http://cdn.sstatic.net/stackoverflow/img/favicon.ico
http://cdn.sstatic.net/stackoverflow/img/apple-touch-icon.png
http://cdn.sstatic.net/js/stub.js?v=181da36f6419
http://cdn.sstatic.net/stackoverflow/all.css?v=0f0c93534e2b
http://stackoverflow.com/questions/16264292/extract-all-values-between-double-quotes-from-a-webpages-source-code
http://www.gravatar.com/avatar/91d33760d2823fa7cf5c95b41a16fada?s=32&d=identicon&r=PG\
http://stackoverflow.com/users/2264365/ajakblackgoat
http://stackexchange.com
http://chat.stackoverflow.com
... etc
ajakblackgoat
  • 2,119
  • 1
  • 15
  • 10