1

I have what I hope is an easy question. I'm trying to use iTextSharp to modify some PDF files, however it seems that the XMP metadata that iTextSharp puts at the end of the files is ruining the layout of the PDF files (and I'm not very conversant in the PDF format to understand at all why).

Here's a small section of the original document And the same section from the 'edited' document You can see from the two images above that the document appears to have been rotated. From looking at the PDF files as binary differences however, the only thing different appears to be some XMP metadata at the end of the files

DIFF of files showing XMP metadata at end as only difference

I've tried opening the files in several PDF viewers (Sumatra PDF, Edge Browser and Adobe Acrobat) and all show the same weirdness.

I guess I have two questions: a) How can the PDF file be so altered from just having XMP meteadata at the end of the file? b) How can I make iTextSharp not produce this output? (iTextSharp only seems to do this when I Add/Edit content, and not if I just strip out Javascript or similar)

<EDIT 1>
The code that I'm using for the iTextSharp is the PdfContentStreamEditor (verbatim) from the post here: https://stackoverflow.com/a/35915789/2535822
</EDIT 1>
<EDIT 2>
Ok.. it seems that it's not the XMP Metadata. I got rid of that by using:

pdfStamper.XmpMetadata = new byte[0];

However there is still a bunch of extra data placed at the end of the file

2 0 obj
<</Producer(PDFCreator 2.5.2.5233; modified using iTextSharp’ 5.5.13 ©2000-2018 iText Group NV \(AGPL-version\))/CreationDate(D:20171206173510+10'30')/ModDate(D:20180325144710+11'00')/Title(þÿ
endobj
404 0 obj
<</Length 0/Type/Metadata/Subtype/XML>>stream

endstream
endobj
405 0 obj
<</Length 3638/Filter/FlateDecode>>stream
xœÍZmÅ/6ÒZ2ÁÆ€
....

</EDIT 2>

BevanWeiss
  • 135
  • 1
  • 15
  • Probably there is an issue with my `PdfContentStreamEditor` class. To verify I'd need the PDF in question, though. – mkl Mar 25 '18 at 07:21
  • I have another PDF that also seems to show 'weirdness' when put through the code. I can send this instead, since it doesn't contain any of our privileged corporate info. How best to send to you? I did have a look through the Adobe PDF spec, because I was surprised by the Write method putting a space / newline into the output (I was expecting a full 1:1 write through)... but it seemed valid (albeit, as noted, I don't know anything about the PDF format) – BevanWeiss Mar 25 '18 at 09:24
  • If there is no privileged info in it anymore, you can simply share the file by means of e.g. a public Google drive or drop box share. – mkl Mar 25 '18 at 09:55
  • Here's both an original, and one that has been through the PdfContentStreamEditor (without any editing supposed to have been performed). I only did the EditContent call on the first page, so the other pages are still healthy. https://drive.google.com/open?id=1KSXgoPgkUX9atCPQXDcx86T30xLBgJYJ – BevanWeiss Mar 25 '18 at 09:59
  • The file you shared appears to have a feature that covers pages with a note under some circumstances. Such features can be quite sensitive to document changes. I'll try and understand that feature better someone the next days. – mkl Mar 25 '18 at 14:19
  • By the way, you should change the title of your question as that obviously is not anymore what you are trying to do. – mkl Mar 25 '18 at 20:05
  • Ok, I can reproduce the scrambling of the text on the first page... in contrast to the original code, though, I had to use append mode for that. As far as I can see now, the cause has to do with the password protection of the document (it is encrypted using the default password, so one does not have to enter a password but it is encrypted nonetheless which is why Adobe Reader shows "(SECURED)" thereafter). I'll look into that. – mkl Mar 25 '18 at 20:31
  • I created an answer for the issue with this example file. The problem rotating the contents surely is a different matter, though. If possible, also share that file, please. – mkl Mar 25 '18 at 21:17
  • I've added a set of revised files to the same google drive share as before, they are generated from PHA-Pro, and Cute PDF Writer... I suspect that it's an issue with page rotation as the entire document is landscape, whilst the resultant page seems to have the content rotated to be portrait (but still on a landscape document layout). – BevanWeiss Mar 26 '18 at 00:22
  • I added a section to my answer which explains the rotation and also how to prevent it. – mkl Mar 26 '18 at 11:25

2 Answers2

1

I can answer your second question. The metadata you are trying to remove is not supposed to be removed. The DLL of the AGPL version that you are using will add that metadata, no matter what you do with code. You will not be able to remove it with iText as it is a direct violation of their licence terms. Please refer to : https://itextpdf.com/AGPL

You must prominently mention iText and include the iText copyright and AGPL license in output file metadata.

Joris Schellekens
  • 8,483
  • 2
  • 23
  • 54
Hakan Usakli
  • 492
  • 5
  • 11
  • OK, is it possible that this iText / AGPL metadata is causing the display issues that I'm seeing? It does seem to end with a bunch of stuff like standard PDF elements... <>/XObject<>/Font<>>>/Contents[405 0 R]>> endobj xref ..... trailer <<994a08ca7d4b4a1737acb7b2820c620c>]/Prev 2532558>> %iText-5.5.13 startxref 2545243 %%EOF – BevanWeiss Mar 25 '18 at 04:18
  • Metadata **never** changes the visual representation of a PDF. However, I read that (1) you don't know ISO 32000 that well, but (2) you are editing content streams. That is a contradiction. That's like saying to the PDF: I'm not a surgeon nor a brain specialist, but I'm going to do brain surgery on you. If you want to edit content streams, you need to know what you're doing. – Bruno Lowagie Mar 25 '18 at 07:05
  • @bruno Perhaps I'm wrong then... and there's more than just metadata involved. An original and damaged PDF can be found here https://drive.google.com/open?id=1KSXgoPgkUX9atCPQXDcx86T30xLBgJYJ – BevanWeiss Mar 25 '18 at 10:03
  • Because you use iText in an AGPL context, where can we see your entire code? Somewhere on GitHub maybe? I mean, your files are obviously proprietary, but your code isn't. (because AGPL) – Amedee Van Gasse Mar 25 '18 at 10:10
  • @Amadee-van-gasse, my entire code consists of one Windows Form, one CS file from mkl's stackoverflow post and the itextsharp nuget package. It has not been distributed or made available for use by anyone outside of me.. it also currently doesn't do anything beside spit out a wonky page 1. As you know from the AGPL, if it's not being made available for use by others or distributed then the modified work source code does not need to be released... However if I do get something working, then I will put it on github. If not, I will bin it, and do something else. – BevanWeiss Mar 25 '18 at 10:29
  • @AmedeeVanGasse since things are now somewhat working, you'll be pleased to know that I've put the source on github https://github.com/bevanweiss/PdfEditor Really not much to go on... still not sure what's causing the rotation on landscape pages, but it might be useful for some people... It seems to address what like 90% of posts are about, replacing text in a PDF. I realise it has huge limitations, but 'it works for me'. The auto-redacting feature is already coming in handy for me also... – BevanWeiss Mar 26 '18 at 07:00
  • @BevanWeiss For the redaction part I'd propose using the `PdfCleanUp` classes from the iTextSharp Extra package as they (to a certain degree) do remove the redacted content. The iText 7 `pdfSweep` module is based thereupon. – mkl Mar 26 '18 at 11:34
  • @mkl thanks again :) Yeah, the redaction part could be done more robustly for sure. I guess the best way would be a combination, if it's a Tj element found, then convert it to a TJ, remove the redacted text string and put a shift in the direction of the missing text, then once the text is removed put the black bar overlay to indicate that it has been redacted. – BevanWeiss Mar 26 '18 at 11:48
  • @BevanWeiss Actually a generic solution requires quite a lot more. Do have a look at the `PdfCleanUp` stuff, it is not perfect but already does consider a lot of stuff. – mkl Mar 26 '18 at 11:56
1

You have indeed found a bug in the PdfContentStreamEditor I used in this answer while the other issue requires one to know how to disable a special feature or quirk (depending on the circumstances) of iText.

Rotation of the content

This part deals with the rotation of content in the sample document PHA-Pro 8 - File.pdf provided by the OP.

As you already have seen yourself, the rotation issue appears connected with the fact that the page rotation of the page in question is not 0.

Indeed, the iText PdfStamper has a feature which in case of rotated pages automatically rotates additions one applies to the OverContent or UnderContent. This feature can be quite handy if you want to add upright content to the page without having to apply rotation yourself to make it upright. In case of the PdfContentStreamEditor, though, all coordinates we receive from the existing content already have the applicable rotation factored in.

Thus, we need to disable this feature. One can do so using the PdfStamper property RotateContents:

using (PdfReader pdfReader = new PdfReader(source))
using (PdfStamper pdfStamper = new PdfStamper(pdfReader, new FileStream(dest, FileMode.Create, FileAccess.Write), (char)0, true))
{
    pdfStamper.RotateContents = false;
    PdfContentStreamEditor editor = new PdfContentStreamEditor();

    for (int i = 1; i <= pdfReader.NumberOfPages; i++)
    {
        editor.EditPage(pdfStamper, i);
    }
}

Scrambling of text

This part deals with the scrambling of text in the sample document AS62061-2006.pdf provided by the OP.

You have found a bug in the PdfContentStreamEditor. Its Write method contains this loop:

foreach (PdfObject pdfObject in operands)
{
    pdfObject.ToPdf(canvas.PdfWriter, canvas.InternalBuffer);
    canvas.InternalBuffer.Append(operands.Count > ++index ? (byte) ' ' : (byte) '\n');
}

It should instead be

foreach (PdfObject pdfObject in operands)
{
    pdfObject.ToPdf(null, canvas.InternalBuffer);
    canvas.InternalBuffer.Append(operands.Count > ++index ? (byte) ' ' : (byte) '\n');
}

If one presents the PdfWriter to the ToPdf method of a PdfString and the PdfWriter uses encryption, the string contents are getting encrypted. But here the string is written to a stream, and in that case not the individual string must be encrypted but instead eventually the whole stream.

This applies to the PDF provided by the OP because

  • the PDF is encrypted using the default password and
  • the OP edited using a PdfStamper in append mode which encrypts the additions using the same password as the original file.

With the original code, the result looks like this:

broken page content

With the fixed code, it looks like this:

proper page content

mkl
  • 90,588
  • 15
  • 125
  • 265