0

Please excuse this lengthy question. I've written a C# App that uses WebClient.DownloadFileAsync to pull down and save a file to a client computer.

This works successfully for a pdf file, whose Internet folder location doesn't change. However, I'm also trying to download some audio files with a .mp3.zip extension.

If I input the URL for these files, I'm taken directly to the file download site where I'm presented with a dialog to either select individual files or click a link to "Download All Files".

I want to programmatically download the entire .mp3.zip file.

The problem with the "Download All Files" link is that, it appears to include a random folder naming scheme in its URL. For example, http://download.site.org/files/audio_books/xx/zipfile.mp3.zip; the xx being a changing folder location.

If the URL for the audio files always had the same exact location, I could use WebClient.DownloadFileAsync without a problem. I'm able to manually read the Outer HTML if I inspect the element for the link, but I've observed that this (xx) changes monthly.

If I could find a way to successfully parse the URL in the Download link, I could verify what the current (xx) folder name is and then use WebClient normally.

I've been all over the Internet and read through numerous StackOverFlow articles, for example Grabbing just the URL of an href using HTMLAgilityPack, and Image scraper with C#, but none of the suggestions appear to return the (xx) folder name contained in the Outer HTML.


I came across another post on SOF, which appears to be the closest answer to my question, i.e. Parse inner HTML

This is what I've tried, but it throws a NullReferenceException.

HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(url);
req.Method = "GET";
req.UserAgent = "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US))";
string source;
using (StreamReader reader = new StreamReader(req.GetResponse().GetResponseStream()))
{
    source = reader.ReadToEnd();
}
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(source);
string hrefValue = doc.DocumentNode
    .Descendants("div")
    .Where(x => x.Attributes["class"].Value == "flRight")
    .Select(x => x.Element("a").Attributes["href"].Value)
    .FirstOrDefault();

Can anyone suggest why the where clause querying the class.value is throwing the exception, or what is needed? I feel I'm really close to solving this issue, because if I inspect the element of the download button, I can see what I need in a div class.

P.S. is the only way to ask additional questions to edit my original post, or the limited text comment box?

Community
  • 1
  • 1
CodeMann
  • 157
  • 9
  • If you know that the root path remains the same and the file extension will always look something like .mp3.zip or .zip, you can try using either regex or substring methods to get the changing folder name. Please clarify if I am missing something here? – Krishna Veeramachaneni Oct 10 '14 at 22:35
  • an URL example would help. – aybe Oct 10 '14 at 22:35
  • Where is the problem to read the page montly and parse the HTML from it? simple regex should be commonly work. – Kux Oct 10 '14 at 22:37
  • Audio bookType_102014.mp3.zip,3-different types, each their own URL. Only gets me to page to Manually select individual file or the "Download All Files" link. "Download All Files" link has the folder structure of http://download.site.org/files/audio_books/xx/bookType_102014.mp3.zip in hidden Outer HTML. Unless someone can provide a code example, showing me how I can programmatically read the URL of the Outer HTML that is behind the "Download All Files" link, I'm still stuck with the random changing xx portion of the folder path in that link. That is where the real problem lies. – CodeMann Oct 11 '14 at 21:17

0 Answers0