0

Is their any possibility to get the non-html content from a page? What i mean by non-html is that, words/sentences in a page other than html tags.

I can take the source code by using

Dim sourceString As String = New System.Net.WebClient().DownloadString("SomeWebPage.com")

But how can i get the non-html content only from a webpage as like this?

  • first, get the value of sourceString in a javascript variable, Then use jquery with Regex (use a regular expression which can find html tags <>, plenty out there, Google it) to iterate over the html page and get all non-html content – talhatahir Oct 31 '14 at 05:52
  • 1
    Good grief! RegEx? Try HtmlAgilityPack if you want to parse HTML in the .NET world. – Tim Oct 31 '14 at 06:41
  • 2
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – 5uperdan Oct 31 '14 at 08:36
  • May be better phrased as you want to extract plain text from an HTML page. To do this use an HTML parser. HTMLAgilityPack is one library this is often used.. – Jon P Nov 04 '14 at 03:34

1 Answers1

0

This should work if the html is properly structured ...

Dim myhtml As String = New System.Net.WebClient().DownloadString("http:\\www.test.com")
Dim plaintext As String = System.Text.RegularExpressions.Regex.Replace(myhtml, "<.*?>", "")
Rob
  • 3,488
  • 3
  • 32
  • 27