0

I've some documents in MHTML format and in pdf format. I want to know whether the content is same or not in MHTML and PDF. How can i compare the difference?

J. Steen
  • 15,470
  • 15
  • 56
  • 63
Vishnu
  • 725
  • 4
  • 11
  • 26

1 Answers1

3

You will need an MHTML parser as well as a PDF parser library. Then you traverse both documents in parallell and compare the contents. Not that this is definitely non-trivial to do as you will have to build a mapping system between elements in the different file formats.

If you want to take into account that content can be written in different ways (e.g. tables vs. tabs) and still look exactly the same to the user things get very complicated quickly.

My gut feeling from the way you are asking your questions is that this project is way larger and more complex than you are ready for.

Anders Abel
  • 67,989
  • 17
  • 150
  • 217
  • he can parse to text and ignore spaces\newlines\tabs - e.g comparing ONLY letters (ignoring case and maybe allowing some rate of error - lets say 1 char in every 500 char can have a mistake and still count equal) – Mark Segal Jul 21 '11 at 09:48
  • @Quantic Programming: That would work for simple text documents, but as soon as you have text boxes (div's or whatever in HTML) that are not part of the main text flow you'll run into problems. – Anders Abel Jul 21 '11 at 09:50