0

I'm receiving misnamed documents with a .pdf extension, and I know for a fact that these are supposed to be at least readable as pdfs. Some of these are .p7m files from which I have to obtain the contained document: Apache Tika infers their mime-type as either application/x-dbf or application/pkcs7-signature, and a standard application for p7m processing can easily extract the contained pdf. These documents have to be subsequently processed automatically through an OCR.

I have tried to extract the content in two ways:

  • using BC's CMSSignedData (source: both a colleague and SO);
  • using openssl smime -decrypt -verify -inform DER -in path/to/infile.p7m -noverify -out path/to/outfile.pdf in the command line (source).

A pdf extracted with the former halts in the subsequent automatic processing (although the document is readable with a PDF viewer), while the same pdf extracted with the latter works perfectly fine (it's viewable, and the OCR works on it). Therefore I need to translate the command line script in Java.

I am not familiar with encryption and only understand the very surface level. From my bare bones understanding it seems I should use CMSEnvelopedData, but the example code in the javadoc requires a private key which I don't have. I have tried searching for "extract DER encrypted no private key" and the likes to no real result (at least not one I could understand).

aPonza
  • 492
  • 4
  • 10
  • 1
    p7m files can carry signed and/or encrypted content. If your p7ms are merely signed, not encrypted, Your approach with `CMSSignedData` should work. Unfortunately your provide neither your code of that approach (or do you use the code from that _question_ unchanged?) nor an example file before and after running through that code, so helping is difficult. – mkl Feb 16 '21 at 12:34
  • I unfortunately can't provide the files I'm running through the pipeline, otherwise I would've. Yes, the CMSSignedData code is the same as `removeP7MCodes` from the linked question (construct from bytes[] then write the signed content to a ByteArrayOutputStream). I was hoping someone could infer from the working openssl line what isn't working with respect to the other approach, or could ask me more questions about the data, pushing towards a solution. Both Apache Tika and pdfinfo return `false` when asked about the document encryption. Maybe a byte-by-byte diff between outputs could help? – aPonza Feb 16 '21 at 13:45
  • The byte-by-byte comparison seems to be the same, I'm very confused right now. I checked permissions on the files and they're the same. I'll run the pipeline again asap. Weird, might have to close this. – aPonza Feb 16 '21 at 15:15
  • *"The byte-by-byte comparison seems to be the same"* - :)) and I just want to say, yes, please provide the comparison... – mkl Feb 16 '21 at 15:23
  • Eh, I still can't: if I gave you the hex dump you could still see the whole pdf in the end, which I can't share. EDIT: to be clear, the diff is empty since the files are the same, and the comparison would mean sharing twice the same document, apparently. Do you know/can you explain the difference between the openssl line and the CMS implementation? – aPonza Feb 16 '21 at 15:28
  • *"to be clear, the diff is empty since the files are the same"* - because of that my ":))" smiley above... *"Do you know/can you explain the difference"* - as the output is the same, the issue does not depend on which program it's from. You have to look for other differences in your setup, like (probably inherited) permissions, differences between local and network file systems, filename differences, call differences, timing, whatever. – mkl Feb 16 '21 at 15:48

0 Answers0