0

I am trying to parse a HTML page (The page isn't known and changes often, however they are always news sites). Basically, I need to pull the news out of a bunch of code downloaded from the site, which i'm trying to do with a regex like this:

Match m = Regex.Match(x.Result, @"<p>(.+?)</p>");

Obvious bad idea - it pulls down anything tagged as a paragraph.

Any better ways to pull a news article or large body of text, separated from the code, from a website?

Kevin
  • 383
  • 2
  • 11

1 Answers1

0

Well, this may not be exactly what you want (you haven't provided a lot of detail), but you can strip all tags from a page with a pair of simple regex's.

Remove javascript and CSS:

<(script|style).*?</\1>

Remove tags

<.*?>

Credit goes to this existing answer. What you will be left with is the "plain text" from the page.

Community
  • 1
  • 1
Garcia Hurtado
  • 952
  • 9
  • 15