How I can extract text from HTML without using third-party libraries?

Question

_request = (HttpWebRequest)WebRequest.Create(url);
_response = (HttpWebResponse) _request.GetResponse();
StreamReader streamReader = new StreamReader(_response.GetResponseStream());
string text = streamReader.ReadToEnd();

Text with html tags. How i can get text without html tags?

Why can't you use a 3rd party library? You can do this yourself using string parsing APIs, but at what cost? To have a robust parsing algorithm that works in the face of the malformed HTML present everywhere on the web, you're inventing your own "3rd party library". So why not let others do the work for you? Html Agility Pack is the way to go. — Judah Gabriel Himango, Nov 29 '11 at 22:07

score 3 · Answer 1 · answered Mar 08 '18 at 10:50

3

Try this:

System.Xml.XmlDocument docXML = new System.Xml.XmlDocument();
docXML.Load(url);
string textWithoutTags = docXML.InnerText;

Be happy :)

answered Mar 08 '18 at 10:50

Andreas Mathiel

39
1

score 3 · Answer 2 · answered Nov 29 '11 at 22:29

How do you extract text from dynamic HTML without using 3rd party libraries? Simple, you invent your own HTML parsing library using the string parsing functions present in the .NET framework.

Seriously, doing this by yourself is a bad idea. If you're pulling dynamic HTML off the web, you have to be prepared for different closing tags, mismatched tags, missing end tags, and so forth. Unless you have a really good reason why you need to write one yourself, just use HTML Agility Pack, and let that do the hard work for you.

Also, make sure you're not succumbing to Not Invented Here Syndrome.

score 2 · Answer 3 · edited May 23 '17 at 11:44

2

1) Do not use Regular Expressions. (see this great StackOverflow post: RegEx match open tags except XHTML self-contained tags)

2) Use HtmlAgilityPack. But I see you do not want 3rd Party libraries, so we are forced to....

3) Use XmlReader. You can pretty much use the example code straight from MSDN, and just ignore all cases of XmlNodeType except for XmlNodeType.Text. For that case simply write your output to a StreamWriter.

edited May 23 '17 at 11:44

Community

1
1

answered Nov 29 '11 at 22:02

mattypiper

1,222
8
8

1

XMLReader will not work for most HTML code... I agree with your points 1 and 2 – Adriano Carneiro Nov 29 '11 at 22:14
1

If it is well formed XHTML it will work. It is a marginal solution but considering the author's requirement of no third party libraries it is all we have, and is better than a hackish, brittle regex like `Regex.Replace(htmlText, "<.*?>", string.Empty);`. There are some other posts on stackoverflow of people importing controls from all over the place in .NET because those controls give you an InnerHtml property, but I find that to be equally hackish. – mattypiper Nov 29 '11 at 22:21
1

@ttmatty Agreed. But OP is reading from the internet, not his own pages. He cannot guarantee XHTML. But I can guarantee you that a lot of code (if not most code) will not be XHTML – Adriano Carneiro Nov 29 '11 at 23:00

score 1 · Answer 4 · edited May 23 '17 at 12:32

1

This question has been asked before. There are a few ways to do it, including using a Regular Expression or as pointed out by Adrian, the Agility Pack.

See this question: How can I strip HTML tags from a string in ASP.NET?

edited May 23 '17 at 12:32

Community

1
1

answered Nov 29 '11 at 20:59

David Schwartz

1,956
19
28

this post using 3rd party libraries... i need without. – isxaker Nov 29 '11 at 21:04
regxex answer doesn't fully answer – isxaker Nov 29 '11 at 21:17

How I can extract text from HTML without using third-party libraries?

4 Answers4