I have an application where I will be processing a ton of data that comes into our system as CSV files. Ideally I would like the system to be hardened against data duplication by recognizing and tossing out CSV files that have already been submitted. In general I can rely on the filename being relatively unique but that may not always be the case.
Is there a good technique for running a hash or creating a signature for CSV files that will be useable for de-duping?
I guess at the end of the day I could always compare byte by byte to other files we've already processed that are the exact same size in bytes. ;)
Ultimately my app will be in Javascript but this is really a language-agnostic question in my mind.