2

What are the steps to verify integrity of these documents ? doc,docx,docm,odt,rtf,pdf,odf,odp,xls,xlsx,xlsm,ppt,pptm

Or at least of some of them. Usually when uploaded to a content repository.

I guess that inputStream is always 99,99% read properly from MultiPart http request otherwise exception would be thrown and action taken. But user can upload already corrupted file - do I use third party libraries for checking that? I didn't see anything like that in odftoolkit, itextpdf, pdfbox, apache poi or tika

Danilo Piazzalunga
  • 7,590
  • 5
  • 49
  • 75
lisak
  • 21,611
  • 40
  • 152
  • 243
  • 1
    What kind of corruption are you looking for? Deliberate? Accidental? Single bytes corrupted? Files truncated? And is it enough to say "that file looks a bit iffy", or must you only accept files that say open without warning in Office 2003 build 12345 or Office 2008 for Mac build 4321? – Gagravarr Jul 25 '11 at 09:15
  • I was just wondering how to handle TikaException because parsing is the point where you would probably catch this problem, but you would just get what kind of problem happened during parsing mostly. What should be in this case done ? I'm really responsible for the delivery of the document, it's not like I would store files in content repository.I have no prior experience with document processing, could you give some numbers ? probability ? stats ? – lisak Jul 25 '11 at 09:45

4 Answers4

2

There are many kinds of "corrupt".

  • Some corruptions should be easy to detect. For instance a truncated ODF file will most likely fail when you attempt to open it because the ZIP reader can't read it.

  • Others will be literally impossible to detect. For instance a one character corruption in an RTF file will be undetectable, and so (I think) will most RTF file truncations.


I'd be surprised if you found a single (free) tool to do this job for all of those file types, even to the extent that it is technically possible. The current generation of open source libraries for reading / writing document formats tend to focus on one family of formats only. If you are serious about this, you probably need to use a commercial library.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
0

For all of the above listed file formats there are 3rd-party libraries which can open etc. - I don't know of a "verification only" but I think being able to open them without exceptions etc. is at least a basic check that the file is within the specified format... One such (commercial) library is Aspose - not affiliated, just a happy customer...

Yahia
  • 69,653
  • 9
  • 115
  • 144
  • Mostly there is no "open" but supplying the inputstream and parsing or getting dom model etc., which may fail from variety of reasons, still the document is not corrupted. It's no fun :-) One thing's for sure, I won't pay $7497 for Aspose :-) There are way too much figures – lisak Jul 25 '11 at 00:31
0

You can do checksums/hashes (that is, a secure hash) of the file before uploading, then upload the checksum separately. If the subsequently downloaded file has the same checksum, it has not been changed (to a certain high probability, depending on the checksum/hash used) from the original.

mpez0
  • 2,815
  • 17
  • 12
  • I mentioned that the transport is not problem, but that users might upload it already corrupted. Maybe I shouldn't have used the word "integrity" – lisak Jul 25 '11 at 00:32
0

Go to check LibreOffice project (that already handles these archives), it has parts written in Java, and for sure you could find and use their mecanisms to check for corrupted files.

I think you can get the code from here:

http://www.libreoffice.org/get-involved/developers/

Jaime Hablutzel
  • 6,117
  • 5
  • 40
  • 57