We have an internal web application that accepts a file of varying formats from a user in order to import large amounts of data into our systems.
One of the more recent upgrades that we implemented was to add a way to detect if a file was previously uploaded, and if so, to present the user with a warning and an option to resubmit the file, or to cancel the upload.
To accomplish this, we're computing the MD5
of the uploaded file, and comparing that against a database table containing the previously uploaded file information to determine if it is a duplicate. If there was a match on the MD5
, the warning is displayed, otherwise it inserts the new file information in the table and carries on with the file processing.
The following is the C#
code used to generate the MD5
hash:
private static string GetHash(byte[] input)
{
using (MD5 md5 = MD5.Create())
{
byte[] data = md5.ComputeHash(input);
StringBuilder bob = new StringBuilder();
for (int i = 0; i < data.Length; i++)
bob.Append(data[i].ToString("x2").ToUpper());
return bob.ToString();
}
}
Everything is working fine and well... with one exception.
Users are allowed to upload .xlsx
files for this process, and unfortunately this file type also stores the metadata of the file within the file contents. (This can easily be seen by changing the extension of the .xlsx
file to a .zip
and extracting the contents [see below].)
Because of this, the MD5
hash of the .xlsx
files will change with each subsequent save, even if the contents of the file are identical (simply opening and saving the file with no modifications will refresh the metadata and lead to a different MD5
hash).
In this situation, a file with identical records, but created at different times or by different users will slip past the duplicate file detection, and be processed.
My question: is there a way to determine if the content of an .xlsx
file matches that of a previous file without storing the file content? In other words: is there a way to generate an MD5
hash of just the contents of an .xlsx
file?