1

I am trying to read in the current day's dilbert image. I am able to get the full text of the page by doing this:

        var todayDate = DateTime.Now.ToString("yyyy-MM-dd");
            var web = new HtmlWeb();
            web.UseCookies = true;
            var wp = new WebProxy("http://myproxy:8080");
            wp.UseDefaultCredentials = true;
            NetworkCredential nc = (NetworkCredential)CredentialCache.DefaultCredentials;
            HtmlDocument document = web.Load("http://www.dilbert.com/strips/comic/" + todayDate, "GET", wp, nc);

if I look at the full html of the document I see the image listed multiple times on the page such as:

<meta property="og:image" content="http://assets.amuniversal.com/c2168fa0c45a0132d8f0005056a9545d"/>

or:

<meta name="twitter:image" content="http://assets.amuniversal.com/c2168fa0c45a0132d8f0005056a9545d">

or

  <img alt="Squirrel In The Large Hadron Collider - Dilbert by Scott Adams" class="img-responsive img-comic" height="280" src="http://assets.amuniversal.com/c2168fa0c45a0132d8f0005056a9545d" width="900" />

what is the best way to parse out the URl from this picture?

Ben
  • 2,433
  • 5
  • 39
  • 69
leora
  • 188,729
  • 360
  • 878
  • 1,366
  • possible duplicate of [What is the best way to parse html in C#?](http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c) – Jesse Good May 04 '15 at 23:34

2 Answers2

1

You can try using HtmlAgilityPack or a similar library to parse the structure of the response HTML and then walk the DOM generated by the parser.

xxbbcc
  • 16,930
  • 5
  • 50
  • 83
0

You can use HtmlAgilityPack if you are going to do lots of dom manipulation, but a quick and dirty hack is to just use the built in .Net C# string features..

This is untested and written without an IDE but you could try something like:

var urlStartText = "<meta property=\"og:image\" content=\""
var urlEndText = "\"/>";
var urlStartIndex = documentHtml.IndexOf(urlStartText)+urlStartText.Length;
var url = documentHtml.Substring(urlStartIndex, documentHtml.IndexOf(urlEndText, urlStartIndex) - urlStartIndex);

The idea is to find the start and end position of the html text surrounding the URL and then just using Substring to grab it. You could make a method like "GetStringInbetween(string startText, string endText)" so that it would be reuseable

Edit ** An example of this turned into a method:

/// <summary>
/// Returns the text located between the start and end text within content
/// </summary>
public static string GetStringInBetween(string content, string start, string end)
{
    var startIndex = content.IndexOf(start) + start.Length;
    return content.Substring(startIndex, content.IndexOf(end, startIndex) - startIndex);
}


string url = GetStringInbetween(documentHtml, "<meta property=\"og:image\" content=\"", "\">");
caesay
  • 16,932
  • 15
  • 95
  • 160