In C#, how can I parse out a url from a html page that I have got using webproxy.load()?

Question

I am trying to read in the current day's dilbert image. I am able to get the full text of the page by doing this:

        var todayDate = DateTime.Now.ToString("yyyy-MM-dd");
            var web = new HtmlWeb();
            web.UseCookies = true;
            var wp = new WebProxy("http://myproxy:8080");
            wp.UseDefaultCredentials = true;
            NetworkCredential nc = (NetworkCredential)CredentialCache.DefaultCredentials;
            HtmlDocument document = web.Load("http://www.dilbert.com/strips/comic/" + todayDate, "GET", wp, nc);

if I look at the full html of the document I see the image listed multiple times on the page such as:

<meta property="og:image" content="http://assets.amuniversal.com/c2168fa0c45a0132d8f0005056a9545d"/>

or:

<meta name="twitter:image" content="http://assets.amuniversal.com/c2168fa0c45a0132d8f0005056a9545d">

or

  <img alt="Squirrel In The Large Hadron Collider - Dilbert by Scott Adams" class="img-responsive img-comic" height="280" src="http://assets.amuniversal.com/c2168fa0c45a0132d8f0005056a9545d" width="900" />

what is the best way to parse out the URl from this picture?

possible duplicate of [What is the best way to parse html in C#?](http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c) — Jesse Good, May 04 '15 at 23:34

score 1 · Accepted Answer · answered May 04 '15 at 23:29

1

You can try using HtmlAgilityPack or a similar library to parse the structure of the response HTML and then walk the DOM generated by the parser.

answered May 04 '15 at 23:29

xxbbcc

16,930
5
50
83

caesay · Answer 2 · 2015-05-04T23:51:05.917

You can use HtmlAgilityPack if you are going to do lots of dom manipulation, but a quick and dirty hack is to just use the built in .Net C# string features..

This is untested and written without an IDE but you could try something like:

var urlStartText = "<meta property=\"og:image\" content=\""
var urlEndText = "\"/>";
var urlStartIndex = documentHtml.IndexOf(urlStartText)+urlStartText.Length;
var url = documentHtml.Substring(urlStartIndex, documentHtml.IndexOf(urlEndText, urlStartIndex) - urlStartIndex);

The idea is to find the start and end position of the html text surrounding the URL and then just using Substring to grab it. You could make a method like "GetStringInbetween(string startText, string endText)" so that it would be reuseable

Edit ** An example of this turned into a method:

/// <summary>
/// Returns the text located between the start and end text within content
/// </summary>
public static string GetStringInBetween(string content, string start, string end)
{
    var startIndex = content.IndexOf(start) + start.Length;
    return content.Substring(startIndex, content.IndexOf(end, startIndex) - startIndex);
}

string url = GetStringInbetween(documentHtml, "<meta property=\"og:image\" content=\"", "\">");

In C#, how can I parse out a url from a html page that I have got using webproxy.load()?

2 Answers2