0

Could anyone please teach me to extract info out of html in C#? I'm working with WinRT class library in C#.

I want to extract the main content and image out of http://lifehacker.com/5923026/remains-of-the-day-google-image-search-gets-knowledge-graph-integration.

Here are partial website code,

<html xmlns="http://www.w3.org/1999/xhtml" class="feature_chompcommentimages feature_s3upload feature_switch feature_powwowtest" xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
  **<title>Remains of the Day: Google Image Search Gets Knowledge Graph Integration</title>**
          <meta http-equiv="content-type" content="text/html; charset=utf-8" />
  <meta http-equiv="content-language" content="en" />
  <meta http-equiv="refresh" content="86400" />
  <meta name="robots" content="all" />
                      <meta name="keywords" content="For What It&#039;s Worth, remainders, in brief, Lifehacker" />
                  <meta property="fb:page_id" content="7568536355" />
                              <meta name="title" content="Remains of the Day: Google Image Search Gets Knowledge Graph Integration" />
      **<meta name="description" content="Google updates Image Search with Knowledge Graph integration, VLC for OS X now supports Retina display, Sparrow updates with Retina display and Mountain Lion support, and Amazon introduces barcode scanning app Flow for iOS. " />**
                      <link rel="image_src" href="http://img.gawkerassets.com/img/17rm77tdcfd31jpg/original.jpg" />
          <meta property="og:image" content="http://img.gawkerassets.com/img/17rm77tdcfd31jpg/xlarge.jpg" />
                  <meta property="og:site_name" content="Lifehacker"/>
      <meta property="og:title" content="Remains of the Day: Google Image Search Gets Knowledge Graph Integration" />
      <meta property="og:description" content="Google updates Image Search with Knowledge Graph integration, VLC for OS X now supports Retina display, Sparrow updates with Retina display and Mountain Lion support, and Amazon introduces barcode scanning app Flow for iOS." />
      <meta property="og:type" content="article" />

I can use SyndicationFeed.Title.Text (using Windows.Web.Syndication;) to extract Remains of the Day: Google Image Search Gets Knowledge Graph Integration

please help me extract

<meta name="description" content="Google updates Image Search with Knowledge Graph integration, VLC for OS X now supports Retina display, Sparrow updates with Retina display and Mountain Lion support, and Amazon introduces barcode scanning app Flow for iOS. " />*

I also need to extract the main content inside

<div id="container"> <script type="text/javascript">

<!-- %JUMP:More &raquo;% --><\/p>\n<ul>\n<li><a href=\"http:\/\/insidesearch.blogspot.com\/2012\/07\/find-smarter-more-comprehensive-search.html\">Find Smarter, More Comprehensive Search by Image Results<\/a> <i>Google updated its Image Search with a couple of new features. One being an expanded view that lets searchers see the text around matching images, and the other being added support for Knowledge Graph to image search results, which means Google will attempt to identity any photo that you upload or link to and provide more information about the subject.<\/i> [Google Blog]<\/li>\n<li>

Content: "Find Smarter, More Comprehensive Search by Image Results" "Google updated its Image Search with a couple of new features. One being an expanded view that lets searchers see the text around matching images, and the other being added support for Knowledge Graph to image search results, which means Google will attempt to identity any photo that you upload or link to and provide more information about the subject. [Google Blog]"

Thanks a lot!!

[7/4/12]
sorry guys, I'm trying to extract the text(as string) and image(link or BitmapImage) out of the html by parsing directly from html or parse it by converting it to xml first.

I use HtmlAgilityPack from htmlagilitypack.codeplex.com with tutorial from 4guysfromrolla.com/articles/011211-1.aspx. Although I'm still wondering if there is better solution for Metro style app, since HtmlAgilityPack lack some support for it. For instance, it has method to convert html to xml, but WinRT no longer supports XmlTextReader from .NET.

Thanks again

Jerry
  • 1,018
  • 4
  • 13
  • 22
  • Where do you want to extract the info? Extract as a stream or Extract as a file? –  Jul 03 '12 at 03:09
  • Jerry, if the answer I gave is not one your looking for, it would be more polite to get in touch (like commenting on my answer) then simply voting it down. You're asking for help, I'm trying to help you. – Andre Calil Jul 03 '12 at 03:12
  • 1
    You're going to run into trouble treating that file as XML, because it's not valid XML. Try using an HTML parsing library. See: http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c – Bennor McCarthy Jul 03 '12 at 03:17
  • duplicate of http://stackoverflow.com/questions/11304400/how-to-parse-html-or-convert-html-to-xml-so-i-extract-the-information-out-of-the – Phil Jul 03 '12 at 04:24
  • possible duplicate of [How to read XML in C#?](http://stackoverflow.com/questions/10476782/how-to-read-xml-in-c) – powtac Jul 03 '12 at 10:51
  • sorry guys, I found the answer. I'm trying to extract the text(content) and image(link) out of the html by parsing directly from html or parse it by converting it to xml first. I use HtmlAgilityPack from http://htmlagilitypack.codeplex.com/ with tutorial from http://www.4guysfromrolla.com/articles/011211-1.aspx. Although I'm still wondering if there is better solution for Metro style app, since HtmlAgilityPack lack some support for it. For instance, it has method to convert html to xml, but WinRT no longer supports XmlTextReader from .NET. – Jerry Jul 04 '12 at 01:35
  • @Jerry: I suspect there may be others who share this need in the future - if you have the time to write up a short answer explaining how you managed this on WinRT, they could benefit from your efforts. – Shog9 Jul 04 '12 at 02:46

1 Answers1

0

Jerry, rather than parsing this XML, I'd recommend you to use a RSS library. Take a look at RssToolkit.

Andre Calil
  • 7,652
  • 34
  • 41