-5

I'm trying to extract the text of an url using WebClient in C#. But the content contains html tags and I only want raw text. My code is as follows:

string webURL = "https://myurl.com";
WebClient wc = new WebClient();
byte[] rawByteArray = wc.DownloadData(webURL);
string webContent = Encoding.UTF8.GetString(rawByteArray);

I get the following error with the above code:

'The remote server returned an error: (403) Forbidden.

and change my code to:

string webURL = "https://myurl.com";
WebClient wc = new WebClient();
wc.Headers.Add("user-agent", "Only a Header!");
byte[] rawByteArray = wc.DownloadData(webURL);
string webContent = Encoding.UTF8.GetString(rawByteArray);

The above code has no error, but the result contains html tags. html tags can be removed using Regex:

var result= Regex.Replace(webContent, "<.*?>", String.Empty);
       

But this method is not accurate and does not good performance. Is there a better way to extract just the text without the html tags from an url?

Hossein Sabziani
  • 1
  • 2
  • 15
  • 20

2 Answers2

3

The Navigate function doesn't block execution. You need to register for the DocumentCompleted event, then you should be able to grab the contents within that.

Matthew Haugen
  • 12,916
  • 5
  • 38
  • 54
0

It's not the way you're using that. First of all you should know you have to use Web Client

Now you can try this code :

    WebClient client = new WebClient();
    string content = client.DownloadString("https://stackoverflow.com/search?q=web+browser+c%23");
Ali Vojdanian
  • 2,067
  • 2
  • 31
  • 47