How to get only plain text from HTML using C#?

Question

Hi guys.

I'm trying to create an app that will find the most frequently used words in the string. In my case, a string is the HTML. I've already can get HTML from URI. For example for "https://www.bbc.com/news/world-middle-east-57327591".

var url = "https://www.bbc.com/news/world-middle-east-57327591";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);

Html variable has the same HTML as in the Source. That's well.

But how to get rid of all styles, scripts, and additional information. And get only plain text in some string variable?

I want my application not to be only for BBC html, but for every HTML which I can get in the net. I have an idea that I should get text from every element such us <div>,<p>,<b>,<i>,<a> because not all of the text store in the <p>.

Does this answer your question? [What is the best way to parse html in C#?](https://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c) — devlin carnate, Jun 02 '21 at 21:25
You can do it on the client with DOMParser.parseFromString https://developer.mozilla.org/en-US/docs/Web/API/DOMParser/parseFromString — Flydog57, Jun 02 '21 at 21:57

score -1 · Answer 1 · answered Jun 02 '21 at 21:24

As per This answer, try the following:


var url = "https://www.bbc.com/news/world-middle-east-57327591";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
//Create a regex pattern that selects all html tag elements
string pattern = @"<(.|\n)*?>";
//Replace all tag elements found using that regex with  nothing 
return Regex.Replace(htmlString, pattern, string.Empty);

How to get only plain text from HTML using C#?

1 Answers1

Linked