Test for Duplicate CSV Files?

Asked Aug 05 '21 at 12:38

Active Aug 05 '21 at 12:38

Viewed 195 times

I have an application where I will be processing a ton of data that comes into our system as CSV files. Ideally I would like the system to be hardened against data duplication by recognizing and tossing out CSV files that have already been submitted. In general I can rely on the filename being relatively unique but that may not always be the case.

Is there a good technique for running a hash or creating a signature for CSV files that will be useable for de-duping?

I guess at the end of the day I could always compare byte by byte to other files we've already processed that are the exact same size in bytes. ;)

Ultimately my app will be in Javascript but this is really a language-agnostic question in my mind.

asked Aug 05 '21 at 12:38

Ken

The fact that the file is a CSV file doesn't seem to be relevant to me: if you have to compare all the corresponding fields of several files, why not compare simply the lines or even the whole file? – Pierre François Aug 05 '21 at 12:44
Here a hash function in Javascript: https://stackoverflow.com/questions/6122571/simple-non-secure-hash-function-for-javascript#8831937 – Pierre François Aug 05 '21 at 12:46

Test for Duplicate CSV Files?

0 Answers0