I'm trying to extract the text of an url using WebClient
in C#.
But the content contains html tags and I only want raw text.
My code is as follows:
string webURL = "https://myurl.com";
WebClient wc = new WebClient();
byte[] rawByteArray = wc.DownloadData(webURL);
string webContent = Encoding.UTF8.GetString(rawByteArray);
I get the following error with the above code:
'The remote server returned an error: (403) Forbidden.
and change my code to:
string webURL = "https://myurl.com";
WebClient wc = new WebClient();
wc.Headers.Add("user-agent", "Only a Header!");
byte[] rawByteArray = wc.DownloadData(webURL);
string webContent = Encoding.UTF8.GetString(rawByteArray);
The above code has no error, but the result contains html tags. html tags can be removed using Regex
:
var result= Regex.Replace(webContent, "<.*?>", String.Empty);
But this method is not accurate and does not good performance. Is there a better way to extract just the text without the html tags from an url?