check to see if URL is a download link using webclient c#

Question

I am reading from the history database, and for every URL read, I am downloading it and storing the data into a string. I want to be able to determine if the link is a download link, i.e. .exe or .zip for e.g. I am assuming I need to read the headers to determine this, but I don't know how to do it with WebClient. Any suggestions?

while (sqlite_datareader.Read())
{
    noIndex = false;

    string url = (string)sqlite_datareader["url"];

    try
    {
        if (url.Contains("http") && (!url.Contains(".pdf")) && (!url.Contains(".jpg")) && (!url.Contains("https")) && !isInBlackList(url))
        {

            WebClient client = new WebClient(); 
            client.Headers.Add("user-agent", "Only a test!");


            String htmlCode = client.DownloadString(url);
        }
    }
}

score 2 · Answer 1 · answered May 10 '11 at 13:33

2

You're on the right track; you'll need to examine the ResponseHeaders after a successful request:

var someType = "application/zip";
if (client.ResponseHeaders["Content-Type"].Contains(someType)) {
    // this was a "download link"
}

The tricky part will be in determining what constitutes a download link since there are so many content types possible. For example, how would you decide whether XML data is a download link or not?

answered May 10 '11 at 13:33

Yuck

49,664
13
105
135

That's true. Perhaps there is a way to check the size of data before download? However, seeing as I don;t have much time, .exe, .zip and .rar will suffice. Thank you – michelle May 10 '11 at 13:39
ok still, I will need to download the string or get the response stream..the reason why i want to filter out .exe etc is so that i won't need to download them. unfortunately not all links contain .exe in their URL and so i will need to see response header :/ – michelle May 10 '11 at 13:55
3

You could try using `DownloadStringAsync()` instead. Then as soon as you have the headers you can determine what to do with the content and either cancel or allow the download to complete. – Yuck May 10 '11 at 14:17

score 2 · Accepted Answer · edited May 23 '17 at 12:26

Instead of loading the complete content behind the link, I would issue a HEAD request.

The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. The metainformation contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request. This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself. This method is often used for testing hypertext links for validity, accessibility, and recent modification.

Quote of http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html

See these questions for C# examples

score 1 · Answer 3 · answered May 10 '11 at 13:33

1

Try to check WebClient's ResponseHeaders collections to validate response file type.

answered May 10 '11 at 13:33

Pavel Morshenyuk

10,891
4
32
38

score 0 · Answer 4 · answered May 11 '11 at 11:12

In case, anyone has the same problem, I have used an attribute in the history places.sqlite database which came in very handy!

Places.sqlite contains a table called moz_historyvisits which contains a column visit_type. According to [1], a visit_type of 7 is a download link. Therefore, reading this value will determine if it is a download link without reading the response header or even sending out a head method.

[1] http://www.firefoxforensics.com/research/moz_historyvisits.shtml

check to see if URL is a download link using webclient c#

4 Answers4

Linked