0

I have a requirement where I want to compare 2 identical excel/ppt/csv files which may have exact same content but may be created at different point in time.

I want to compare only the file contents in whatever manner possible using any nodejs package.

But I couldn't figure out how it can be done in an easier way either by stream comparison or even buffer comparison also didn't help.

I've done more research but not much success and I'm just wondering how it would be possible to ignore certain things such as time stamp and any other metadata while doing comparison and only consider contents to match up.

I've tried stream-compare, stream-equal, file-compare, buff1.equals(buff2) and few others but nine of them seem to have worked for my requirement.

But I didn't find any node package on the web which does what I am looking for.

Any insights or any suggestions as how it can be achieved?

Thanks in advance any help would be appreciated.

bhaskerchari
  • 101
  • 1
  • 4
  • Search for a package that computes a hash on the document ( for example sha256) and compare them for 2 documents. – fenixil Aug 20 '19 at 05:01
  • @Illia Popov, I'm not sure if that really helps because when hashing it would consider the file creation/modified dates and content as well I believe. Though I considered this approach but didn't try this because of my past experience with hashing for the reason I mentioned in this comment above. Anyways I will give it a shot once. Thanks for the response – bhaskerchari Aug 20 '19 at 05:06
  • If you refer to filesystem metadata (file creation/update time) then it is not stored in the content stream and you are good to use hashing. If metadata is stored in file itself (company/author...) then I don't think there is an easy way to compare them. One thing that comes to my mind is to convert doc to a common format (print to pdf for example) and match the result. https://pandoc.org/ might be useful for this scenario. – fenixil Aug 20 '19 at 05:19

1 Answers1

0

Search for a package that computes a hash on the document, for example crypto, calculate hashes (sha256) for 2 docs and compare them. If hashes match, document content will be the same (there is still a chance of hash collision, but it depends on the hash algorytm that are you using, sha256 will give you a decent confidence that documents are identical). Check this thread for more details: Obtaining the hash of a file using the stream capabilities of crypto module (ie: without hash.update and hash.digest)

fenixil
  • 2,106
  • 7
  • 13
  • Popopv,It doesn't help for the solution that I am looking at. I tried with Crypto module as suggested but still it gives different hash for the files and comparison fails. – bhaskerchari Aug 23 '19 at 13:05
  • Hashes are different even if you copy the file? If hashes are different then content is different, how do consider them identical? That would be very helpful if you attach 2 files to your question so that communit we could better understand the problem. – fenixil Aug 23 '19 at 13:14
  • I think I figured out the solution myself by using custom logic of my own and utilizing another node package called dir-compare which addresses usecases for pptx, excel but for pdf I am taking slightly a different approach of my own custom thing and using dir-compare again. I will try to post the solution once I'm full sure of the fix. Thanks for the support so far – bhaskerchari Aug 26 '19 at 17:08
  • I checked the package and I'm happy that it helps you. Unfortunately I don't quite understand the problem that you are trying to approach, so have no idea why Dir-compare works and hashes - don't. Kindly ask to provide files or repro instructions in your post. – fenixil Aug 27 '19 at 02:44
  • HI @Fenixil, the scenario to test or reproduce is quite simple you can just try creating two excel files and try adding same contents within a single or multiple sheets n try hashing the contents and the hash that you get should be similar in order for you to make sure the excel files have same contents but the hash wouldn't be same. Though in my case the excel sheets would have more sheets and also images are also Embedded as part of them and hashes seems to be different when I compare, though the contents are same. Typically in my case 2 sheets are created at a different point in time. – bhaskerchari Aug 28 '19 at 05:43
  • Hi @Fenixil, when you copy the file and then compare the hash then it might match up but as you may be knowing that's not a practical usecase. Here the point is 2 excels are generated seperately at different point in time though with exact same contents. You can test this scenario by simply creating 2 seperate excel files manually and put the same contents in them and then try hashing on the files seperately and match up then you will notice they are not same. – bhaskerchari Aug 28 '19 at 05:55