Read text from webpage without html in c#

Question

I'm trying to extract the text of an url using WebClient in C#. But the content contains html tags and I only want raw text. My code is as follows:

string webURL = "https://myurl.com";
WebClient wc = new WebClient();
byte[] rawByteArray = wc.DownloadData(webURL);
string webContent = Encoding.UTF8.GetString(rawByteArray);

I get the following error with the above code:

'The remote server returned an error: (403) Forbidden.

and change my code to:

string webURL = "https://myurl.com";
WebClient wc = new WebClient();
wc.Headers.Add("user-agent", "Only a Header!");
byte[] rawByteArray = wc.DownloadData(webURL);
string webContent = Encoding.UTF8.GetString(rawByteArray);

The above code has no error, but the result contains html tags. html tags can be removed using Regex:

var result= Regex.Replace(webContent, "<.*?>", String.Empty);

But this method is not accurate and does not good performance. Is there a better way to extract just the text without the html tags from an url?

score 3 · Answer 1 · answered Mar 01 '15 at 05:57

3

The Navigate function doesn't block execution. You need to register for the DocumentCompleted event, then you should be able to grab the contents within that.

answered Mar 01 '15 at 05:57

Matthew Haugen

12,916
5
38
54

score 0 · Answer 2 · answered Mar 01 '15 at 06:53

It's not the way you're using that. First of all you should know you have to use Web Client

Now you can try this code :

    WebClient client = new WebClient();
    string content = client.DownloadString("https://stackoverflow.com/search?q=web+browser+c%23");

Read text from webpage without html in c#

2 Answers2

Linked