9

I am looking for a reliable way of extracting text given the web address, in ASP.NET/C#. Can anyone point me the right direction?

Also, the web address could be say a news site that might have a lot of ads and menus etc. I need some intelligent way of extracting only the relevant content. Not sure how this could be done as how would I define what relevance is?

Should I maybe read from a RSS feed? Any thoughts on this?

EDIT I Have added a bounty. I am looking to extract "relevant" text from a URL. From "relevant" I mean, it should exclude text from ads (and other irrelevant info). The input will be similar to a news site. I need to extract only the news info and get rid of the extraneous text

Nick
  • 7,475
  • 18
  • 77
  • 128

6 Answers6

4

Once you have downloaded the page, and started using a library like HTML Agility Pack to parse the html, then your work starts :)

Screen scraping is divided into two parts.

First the webcrawler (lots of information on this on the web, and simple code provided here with WebClient by some other answers). The crawler has to traverse links and download pages. If you are downloading a lot of pages and have the start url you could roll your own, or use an existing one. Check out Wikipedia for a list of open source webcrawlers/spiders.

The second part is parsing the html and pulling out only the text you want, and omit any noise (headers, banners, footers etc). Just traversing the DOM is easy with existing libraries, figuring out what to do with what you parse is the hard part.

I've written a bit about it before at another SO question and it might give you some ideas how to manually grab the content you want. From my experience there is no 100% way to find the main content of a page, and more often than not you need to manually give it some pointers. The difficult part is that if the html layout of the page change, then your screen scraper will start to fail.

You could apply statistics and compare the html of several pages in order to deduce where the ads, menus etc are, in order to eliminate those.

Since you mention news sites, there are two other approaches which should be easier to apply to these sites compared to parsing out the text from the original html.

  1. Check if the page has a print url. Eg. a link on CNN has an equivalent print url which is much easier to parse.
  2. Check if the page has a RSS representation, and pick the article text from the RSS feed instead. If the feed don't have all the content, it should give you enough text to locate the text in the full html page.

Also check out The Easy Way to Extract Useful Text from Arbitrary HTML for input to how to create a more general parser. The code is in Python but you should be able to convert it without too much trouble.

Community
  • 1
  • 1
Mikael Svenson
  • 39,181
  • 7
  • 73
  • 79
3

I think you need a html parser like HTMLAgilityPack or you can use the new born baby.. YQL, its a new tool develop by Yahoo its syntax is like SQL and you need a little knowledge of XPATH...

http://developer.yahoo.com/yql/

Thank

Shakeeb Ahmed
  • 1,778
  • 1
  • 21
  • 37
2

Use a WebClient instance to get your markup...

Dim Markup As String

Using Client As New WebClient()
    Markup = Client.DownloadString("http://www.google.com")
End Using

And then use the HtmlAgilityPack to parse the response with XPath...

Dim Doc As New HtmlDocument()
Doc.LoadXML(Markup)

If Doc.ParseErrors.Count = 0 Then 
    Dim Node As HtmlNode = Doc.DocumentNode.SelectSingleNode("//body");

    If Node IsNot Nothing Then
        'Do something with Node   
    End If
End If
Community
  • 1
  • 1
Josh Stodola
  • 81,538
  • 47
  • 180
  • 227
  • Nice to see some VB here. I will note, however, that there is a C# tag in the question. You'd probably get more up-votes if you provided both. – Armstrongest Apr 12 '10 at 20:35
0

In order to get the actual html markup, try the WebClient object. Something like this will get you the markup:

System.Net.WebClient client = new System.Net.WebClient ();

        // Add a user agent header in case the 
        // requested URI contains a query.

        client.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");

        Stream data = client.OpenRead ("http://www.google.com");
        StreamReader reader = new StreamReader (data);
        string s = reader.ReadToEnd ();
        //"s" now contains your entire html page source
        data.Close ();
        reader.Close ();

Then like isc-fausto said, you can use regular expressions to parse the output as needed.

Steve Danner
  • 21,818
  • 7
  • 41
  • 51
  • Any URL needs to be supported by this app. Since the web pages do not follow the same pattern, I am not sure if it's even possible for the parser to be intelligent in stripping out "irrelevant" data – Nick Feb 13 '10 at 03:31
  • 2
    Trying to use regular expressions to parse HTML can be really hairy and frustrating. Use the HTML Agility Pack if you can - it's a DOM parser, which is REALLY what you need to extract text from HTML. – Brandon Montgomery Feb 13 '10 at 05:53
  • Where does the agility pack fit in? I use Steve's code to grab the HTML and run it through the pack to strip out the html tags and irrelevant content and get plain text? Are there built in methods in the agility pack to do this? Thanks – Nick Feb 14 '10 at 13:51
  • I guess I am confused how the agility pack fits in. Once I have the HTML from your code, how do I use the pack to get the "relevant" text content? – Nick Feb 26 '10 at 17:11
  • 2
    -1 because you are not `using` and -1 again for even thinking about parsing HTML with regex – Josh Stodola Apr 06 '10 at 13:38
0

Text summarization techniques are what you're probably after. But as a rough heuristic, you can do this with some relatively simple steps as long as you aren't counting on 100% perfect results all of the time.

As long as you don't need to support writing systems that don't have spaces between words (Chinese, Japanese), you can get pretty good results by looking for the first couple of runs of a consecutive word sequences with an arbitrary threshold that you'll spend a few days tuning. (Chinese and Japanese would require a reasonable word break identification algorithm in addition to this heuristic).

I would start with an HTML Parser (HTML Agility Pack in Dotnet, or something like Ruby's Nokogiri or Python's BeautifulSoup if you'd like to experiment with the algorithms in a more interactive environment before committing to your C# solution).

To reduce the search space, sequences of links with little or no surrounding text using the features of your HTML parser. That should eliminate most navigation panels and certain types of ads. You could further extend this to look for links that have words after them but no punctuation; this would eliminate descriptive links.

If you start to see runs of text followed by "." or "," with say, 5 or more words (which you can try tuning later), you'd start scoring that as a potential sentence or sentence fragment. When you find several runs in a row, that has pretty good odds of being the most important part of the page. You could score text with <p> tags around it a bit higher. Once you have a fair amount of these types of sequences, The odds are pretty good that you've got "content" rather than layout chrome.

This won't be perfect, and you may need to add a mechanism to tweak the heuristic based on problematic page structures that you regularly scan. But if you build something based on this approach, it should provide pretty reasonable results for 80% or so of your content.

If you find this kind of method inadequate, you may want to look at Bayesian probability or Hidden Markov Models as a way of improving the results.

JasonTrue
  • 19,244
  • 4
  • 34
  • 61
-4

Once you have the web pages html code, you coud use Regular Expressions

seFausto
  • 522
  • 1
  • 8
  • 24
  • 4
    Parsing HTML with regex is impossible. Do not waste your time. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Josh Stodola Apr 06 '10 at 13:39