Get non html content from a page

Question

Is their any possibility to get the non-html content from a page? What i mean by non-html is that, words/sentences in a page other than html tags.

I can take the source code by using

Dim sourceString As String = New System.Net.WebClient().DownloadString("SomeWebPage.com")

But how can i get the non-html content only from a webpage as like this?

first, get the value of sourceString in a javascript variable, Then use jquery with Regex (use a regular expression which can find html tags <>, plenty out there, Google it) to iterate over the html page and get all non-html content — talhatahir, Oct 31 '14 at 05:52
Good grief! RegEx? Try HtmlAgilityPack if you want to parse HTML in the .NET world. — Tim, Oct 31 '14 at 06:41
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — 5uperdan, Oct 31 '14 at 08:36
May be better phrased as you want to extract plain text from an HTML page. To do this use an HTML parser. HTMLAgilityPack is one library this is often used.. — Jon P, Nov 04 '14 at 03:34

score 0 · Accepted Answer · answered Nov 04 '14 at 02:40

0

This should work if the html is properly structured ...

Dim myhtml As String = New System.Net.WebClient().DownloadString("http:\\www.test.com")
Dim plaintext As String = System.Text.RegularExpressions.Regex.Replace(myhtml, "<.*?>", "")

answered Nov 04 '14 at 02:40

Rob

3,488
3
32
27

Get non html content from a page

1 Answers1