I've some documents in MHTML format and in pdf format. I want to know whether the content is same or not in MHTML and PDF. How can i compare the difference?
Asked
Active
Viewed 284 times
0
-
see: http://stackoverflow.com/questions/968935/c-binary-file-compare looks similar – Kevin Burton Jul 21 '11 at 09:44
-
You want to compare the contents? This won't be possible without pretty complex parsers. – Rudi Visser Jul 21 '11 at 09:45
-
Are you saying you want to compare an MHTML file to a PDF file to check if the content is the same? Or that you want to compare two MHTML or two PDF files? – Richard Blewett Jul 21 '11 at 09:46
-
I want to compare an MHTML file to a PDF file. – Vishnu Jul 21 '11 at 10:15
1 Answers
3
You will need an MHTML parser as well as a PDF parser library. Then you traverse both documents in parallell and compare the contents. Not that this is definitely non-trivial to do as you will have to build a mapping system between elements in the different file formats.
If you want to take into account that content can be written in different ways (e.g. tables vs. tabs) and still look exactly the same to the user things get very complicated quickly.
My gut feeling from the way you are asking your questions is that this project is way larger and more complex than you are ready for.

Anders Abel
- 67,989
- 17
- 150
- 217
-
he can parse to text and ignore spaces\newlines\tabs - e.g comparing ONLY letters (ignoring case and maybe allowing some rate of error - lets say 1 char in every 500 char can have a mistake and still count equal) – Mark Segal Jul 21 '11 at 09:48
-
@Quantic Programming: That would work for simple text documents, but as soon as you have text boxes (div's or whatever in HTML) that are not part of the main text flow you'll run into problems. – Anders Abel Jul 21 '11 at 09:50