6

Here is my scenario - I have a windows store app. I have a local file, and a link to a file on the internet. Is there a way I can check if these two files are the same, WITHOUT downloading the file from the link?

The code used to get the file is this:

private static async void SetImage(PlaylistItem song, string source, string imageName)
{

    HttpClient client = new HttpClient();

    HttpResponseMessage message = await client.GetAsync(source);

    StorageFolder myfolder = Windows.Storage.ApplicationData.Current.LocalFolder;
    StorageFile sampleFile = await myfolder.CreateFileAsync(imageName, CreationCollisionOption.ReplaceExisting);
    byte[] byteArrayFile = await message.Content.ReadAsByteArrayAsync();

    await FileIO.WriteBytesAsync(sampleFile, byteArrayFile);

    song.Image = new BitmapImage(new Uri(sampleFile.Path));

}
Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236
Mario Stoilov
  • 3,411
  • 5
  • 31
  • 51
  • What storage service are you using? Most services use hashes for concurrency purposes but the way you retrieve them can vary – Panagiotis Kanavos Aug 08 '13 at 11:34
  • the files in question are youtube video thumbnails – Mario Stoilov Aug 08 '13 at 11:38
  • Duplicate of [Best way to tell if two files are the same?](http://stackoverflow.com/questions/714574/best-way-to-tell-if-two-files-are-the-same). Also, your question is a oneliner that does not show you understand the problem (have you researched any ways of comparing files and why didn't they suffice?) or that you have tried anything. – CodeCaster Aug 08 '13 at 11:49
  • @CodeCaster the question is valid if not perfectly phrased. The post you point to does not apply for web-hosted files. This is more of an HTTP question – Panagiotis Kanavos Aug 08 '13 at 11:57
  • @PanagiotisKanavos while the mere question itself is valid, its current format shows no research whatsoever of OP himself, which is required on SO. If OP actually explained his actual problem (**why** compare YouTube thumbnails) and explained what has been tried, it would be a better question. – CodeCaster Aug 08 '13 at 11:59
  • @CodeCaster please check the discussion. You'll see it's not what you thought. Anyone who has already worked on syncing web files has encountered this problem, even if he doesn't know how to phrase it. – Panagiotis Kanavos Aug 08 '13 at 12:01
  • @PanagiotisKanavos your reply does not address the concerns I mentioned in my comment. I perfectly understand OP's problem, but **op shows no research**. I close-voted with that duplicate so OP had some more to read on the subject. Otherwise I'd have close-voted as _"Offtopic - no understanding of the problem"_. A question that showed a clear step-by-step description of the problem would've immediately triggered me to write a reply that explained OP the use of the ETag and saving that along with the image when it was first downloaded - **if** the local version is downloaded. – CodeCaster Aug 08 '13 at 12:02

5 Answers5

7

The usual solution is to keep a hash of the cloud file somewhere, usually in the file's metadata and compare it with the hash of your local file. Checksums are unsuitable for this operation because they have a very high chance of collision (ie different files having the same checksum).

Most storage services (Azure Blob storage, Amazon S3, CloudFiles) actually use a file's MD5 or SHA hash as its ETag, the value used to detect changes to a file for caching and concurrency purposes. Typically, a HEAD operation on the file will return its headers and ETag value.

If you have the option of picking your own algorithm, choose SHA256 or higher as these algorithms are highly optimized and their large block size means that calculating hashes for large files is much faster. SHA256 is actually much faster than the older MD5 algorithm.

What storage service are you using?

EDIT

If you only want to check files to avoid downloading them again, you can use the ETag directly. ETag was created for exactly this purpose. You just have to store it together with your file when you download it the first time. That's how proxies and caches know to send you a cached version of a picture instead of hitting the destination server.

In fact, you can probably just do a GET on the file with the ETag/If-None-Match headers. The intermediate proxies and the final web server will return a 304 status code if the destination file hasn't changed. This will halve the number of requests you need to download all images in your list.

An alternative is to store the Last-Modified header value for the file and use the If-Modified-Since header in GET

EDIT 2

You mention that the ETag header is null, although your code doesn't show how you retrieve it.

HttpResponseMessage has multiple Headers properties, both on the message itself and its Content. You need to use the proper property to retrieve the ETag value.

You can also check using Fiddler to ensure the server does actually return an ETag.

EDIT 3

Finally found a way to get an ETag from Youtube! The answer comes from "How to get thumbnail of YouTube video link using YouTube API?"

Doing a HEAD or GET on a YouTube thumbnail from ytimg.com does NOT return the ETag or Last-Modified headers.

Using YouTube's Data API and doing a GET on gdata.youtube.com on the other hand, returns a wealth of information about the video. An ETag value is included, although I suspect it changes whenever the video changes. This may be OK though, if you only want to download an image when the video changes, or you don't want to download the image a second time again.

The code I used was:

var url = "http://gdata.youtube.com/feeds/api/videos/npvJ9FTgZbM?v=2&prettyprint=true&alt=json";

using(var  client = new HttpClient())
{
    var response = await client.GetAsync(url);
    var etag1 = response.Headers.ETag;
    var content = await response.Content.ReadAsStringAsync();
    ...
}
Community
  • 1
  • 1
Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236
  • none, The idea is the user is browsing a list of images on the internet (the list is not mine, nor the place where they are held is mine) and I want to limit bandwidth usage, i.e if the user already has this image, don't download it, just load it from the local storage. The images in question are youtube video thumbnails. – Mario Stoilov Aug 08 '13 at 11:36
  • This may be an easier case. You can use HTTP GET with the If-XXX headers to get a file only if it has changed – Panagiotis Kanavos Aug 08 '13 at 11:53
  • This may be a problem with the server or your code. Please post the code. What about Last-Modified? What URL are you hitting? Is it publicly available? – Panagiotis Kanavos Aug 08 '13 at 12:04
1

You could calculate a hash of the file contents like git does. Use MD5 or similar. Then you only need to check if files have the same hash.

David Elliman
  • 1,379
  • 8
  • 15
1

If you want to do a comparison without downloading and you are the one who has placed the file over the internet. Then ideally you should place a checksum of the file uploaded. Then before uploading a new one you can just check the checksum of local file and the one on the server. if it is not equal proceed with the upload else cancel it.

Ehsan
  • 31,833
  • 6
  • 56
  • 65
0

Directly? No. If the file online is also provided with a Hash, you can get a high probability of successfully checking the equality of the files, though.

ZombieSheep
  • 29,603
  • 12
  • 67
  • 114
0

Now with your update, it's kind of clear what your code does: it downloads an image from a given URL and stores it in your application data folder under the given filename. You want to download any image only once.

It's still unclear to me how you call this code, but the solution to me looks like you just need an "URL to filename" translation. So, in psuedo:

BitmapImage GetImage(string sourceURL)
{
    string filename = GetFilenameForURL(sourceURL);

    BitmapImage image;

    if (!FileExists(filename))
    {
        image = DownloadAndSaveImage(sourceURL, filename);
    }
    else        
    {       
        image = ReadImageFile(filename);
    }

    return image;
}

This does not account for images that have been updated on the server. If you want to do that, you need to save metadata in the DownloadAndSaveImage() call, for example the mentioned ETag or last-modified date.

Then to save bandwidth, you can do a HEAD or conditional GET request with an if-none-match or if-modified-since header before the call to ReadImageFile() to check if a newer version is available.

CodeCaster
  • 147,647
  • 23
  • 218
  • 272