I have saved the source code of a webpage (the option in every browser); now I want to catch everything between quotes that starts with http://
. How can I do that?
Asked
Active
Viewed 281 times
0

Ry-
- 218,210
- 55
- 464
- 476

Junaid Rehman
- 169
- 4
- 11
-
1Use the HTML Agility Pack – SLaks Apr 28 '13 at 15:14
-
Are you trying to extract some particular attribute values? An HTML parser could be more useful. – Ry- Apr 28 '13 at 15:14
-
actually the source code of a page may contain links, images etc.. internal link are usually present without "http" so, I'm not interested in them.. thus, everything with http, it may be an external link or an image etc. – Junaid Rehman Apr 28 '13 at 15:57
2 Answers
1
Using HTML Agility Pack
string path = ...
var doc = new HtmlDocument();
doc.Load(path);
var links =
from e in doc.DocumentNode.Descendants()
from a in e.Attributes
where a.Value.StartsWith("http://")
select a.Value;
(note that it only returns links that are in HTML attributes, not in plain text)

Thomas Levesque
- 286,951
- 70
- 623
- 758
0
Use regex:
Dim mc As MatchCollection = Regex.Matches(html, """(http://.+?)""", RegexOptions.IgnoreCase)
For Each m As Match In mc
Console.WriteLine(m.Groups(1).Value)
Next
Sample output when html
= the source code of this page:
http://cdn.sstatic.net/stackoverflow/img/favicon.ico
http://cdn.sstatic.net/stackoverflow/img/apple-touch-icon.png
http://cdn.sstatic.net/js/stub.js?v=181da36f6419
http://cdn.sstatic.net/stackoverflow/all.css?v=0f0c93534e2b
http://stackoverflow.com/questions/16264292/extract-all-values-between-double-quotes-from-a-webpages-source-code
http://www.gravatar.com/avatar/91d33760d2823fa7cf5c95b41a16fada?s=32&d=identicon&r=PG\
http://stackoverflow.com/users/2264365/ajakblackgoat
http://stackexchange.com
http://chat.stackoverflow.com
... etc

ajakblackgoat
- 2,119
- 1
- 15
- 10
-
thanks brother.. it worked perfectly :D can you please tell me how to make these patterns? – Junaid Rehman Apr 28 '13 at 16:52