Parsing HTML - Getting the paragraph with the most text

Question

I am trying to parse a HTML page (The page isn't known and changes often, however they are always news sites). Basically, I need to pull the news out of a bunch of code downloaded from the site, which i'm trying to do with a regex like this:

Match m = Regex.Match(x.Result, @"<p>(.+?)</p>");

Obvious bad idea - it pulls down anything tagged as a paragraph.

Any better ways to pull a news article or large body of text, separated from the code, from a website?

Can you look for any classes or ids that may assist in deciphering if the data inside the tag is useful to you? — Seth McClaine, Oct 08 '14 at 04:38
^To add to the point above, use a HTML parsing library to select the tag and ask it to strip all HTML tags. — nhahtdh, Oct 08 '14 at 04:40

score 0 · Answer 1 · edited May 23 '17 at 12:11

0

Well, this may not be exactly what you want (you haven't provided a lot of detail), but you can strip all tags from a page with a pair of simple regex's.

Remove javascript and CSS:

<(script|style).*?</\1>

Remove tags

<.*?>

Credit goes to this existing answer. What you will be left with is the "plain text" from the page.

edited May 23 '17 at 12:11

Community

1
1

answered Oct 08 '14 at 04:47

Garcia Hurtado

952
9
15

Parsing HTML - Getting the paragraph with the most text

1 Answers1