3

I'm trying to edit some contents of a pdf using PDFBox in Java. The problem is, whenever I edit any string in the pdf, and try to open it using Adobe Reader, the last line does not appear in the newly rendered pdf.

When I try top open the rendered pdf directly from browser, I'm able to see the last line. However, it is encoded in a different format. I'm using the following code to edit the contents of the pdf:

PDDocument doc = PDDocument.load(FileName);
PDPage page = (PDPage) doc.getDocumentCatalog().getAllPages().get(0);
PDStream contents = page.getContents();
PDFStreamParser parser = new PDFStreamParser(contents.getStream());
parser.parse();
List<Object> tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
    Object next = tokens.get(j);
    if (next instanceof PDFOperator) {
        PDFOperator op = (PDFOperator) next;
        if (op.getOperation().equals("Tj")) {
            COSString previous = (COSString) tokens.get(j - 1);
            String string = previous.getString();

            string = string.replace("@ordnum&", (null != data.getOrderNumber()?data.getOrderNumber():""));
            string = string.replace("@shipid&", (null != data.getShipmentId()?data.getShipmentId():""));
            string = string.replace("@customer&", (null != data.getCustomerNumber()?data.getCustomerNumber():""));
            string = string.replace("@fromname&", (null != data.getFromName()?data.getFromName():""));

            tokens.set(j - 1, new COSString(string.trim()));
        }
    }
}

Editing the pdf removes the line which says "Have questions? ...". What is the problem here? Am I doing something wrong?

Thanks.

drunkenfist
  • 2,958
  • 12
  • 39
  • 73
  • *Am I doing something wrong?* - yes: you assume that `COSString` contents are easily editable strings. In general they are not. They may have a custom single- or multi-byte encoding. In your case that bottom line is the only one encoded using a special encoding. And your editing destroys it, – mkl Mar 23 '15 at 23:55
  • But I'm not doing anything to that COSString. When I try to print the string, I do not see that line. So shouldn't it come as it is? If not, how can I process/handle that? I'm new to PDFBox, so I don't know much about it. – drunkenfist Mar 24 '15 at 01:27
  • *But I'm not doing anything to that COSString* - yes, you do. You transform it into a Java String (`previous.getString()`) which you trim and build a new COSString from (`new COSString(string.trim())`) which replaces the original. In case of certain encodings, this can destroy the string altogether. *If not, how can I process/handle that* - PDF contents are not meant to be edited like that at all, neither in PDFBox nor in other PDF libraries. Consider using PDF form fields. – mkl Mar 24 '15 at 05:44

1 Answers1

4

Why that last line becomes invalid

First of all you have to be aware that there are two fundamentally different situations for strings in PDF

  • outside content streams, e.g. author and keywords for the document properties, and
  • inside content streams representing sequences of glyphs from some font to be drawn.

The former type is encoded using either PDFDocEncoding (akin to Latin1) or UTF-16BE with a leading byte-order marker. The method COSString.getString and the constructor COSString(String) are designed for this kind of strings.

The latter type is encoded using the encoding defined for the PDF font this string is to be rendered with. This may be some standardized encoding like WinAnsiEncoding (akin to Latin1) or UniGB-UTF16-H (Unicode (UTF-16BE) encoding for the Adobe-GB1 character collection). But it may also be some custom single- or multi-byte encoding. Neither the standardized nor the custom multi-byte encodings have a byte-order marker.

In the page content stream in your PDF most strings use WinAnsiEncoding (because that is the encoding of their font). Because WinAnsiEncoding and PDFDocEncoding are very similar, the PDFDocEncoding COSString method and constructor you use work quite fine for them.

That last line, though, is encoded using Identity-H which is the horizontal identity mapping for 2-byte CIDs, i.e. a two-byte encoding directly referencing a character ID in the font program without any meaning without that font program.

As this string does not start with a byte order mark, COSString.getString assumes it to use the single-byte encoding PDFDocEncoding and so creates two Java string characters for each original two-byte PDF string character. As the character values for some of these characters are outside the actually valid PDFDocEncoding range, the constructor COSString(String) creates a PDF string in which each of the intermediate Java characters is represented using one two-byte UTF-16BE character; furthermore a byte-order marker is added.

Thus, the original PDF string (in hexadecimal writing)

002b004400590048000300540058004800560057004c0052005100560022000300260052
005100570044004600570003005800560003004400570003004b0057005700530056001d
00120012005a005a005a005600110046004c0057005500580056004f0044005100480011
004600520050001200460052005100570044004600570010005800560012

after your edit becomes

FEFF002B0000004400000059000000480000000300000054000000580000004800000056
000000570000004C00000052000000510000005600000022000000030000002600000052
000000510000005700000044000000460000005700000003000000580000005600000003
0000004400000057000000030000004B00000057000000570000005300000056000002DB
00000012000000120000005A0000005A0000005A0000005600000011000000460000004C
000000570000005500000058000000560000004F00000044000000510000004800000011
000000460000005200000050000000120000004600000052000000510000005700000044
0000004600000057000000100000005800000056

Depending on the PDF viewer this may have different effects. Your original line

original line

e.g. may become spread very wide:

line spread across page

or vanish completely

line vanished

In a nutshell, therefore, if you need to edit a PDF like that, make sure that you only edit PDF strings with a Latin1-like encoding.

If you also need to edit differently encoded PDF strings, extract them as byte[] using the COSString method getBytes, edit this array in a way applicable to the encoding in question, and create a new COSString from the edited bytes using the constructor COSString(byte[]).

But even that is not a good idea at all.

Problems with editing streams like that in general

There are many other traps waiting for you when editing streams like that

  • Instead of e.g.

    (@customer&) Tj
    

    your stream may contain

    (@cust) Tj
    (omer&) Tj
    

    or

    [(@cust) -6 (omer&) ] TJ 
    

    or even

    (omer&) Tj
    -62 0 Td
    (@cust) Tj
    

    Thus, suddenly replacement may not work if a new template uses a slightly different representation.

  • Fonts may only be partially embedded. If the glyphs for the characters of your replacements are not included, they will be drawn as gaps.

  • Text drawing operations following the one you edited may count on the former one to have used a specific width. Your replacement can then destroy the former layout.

  • ...

In essence properly editing streams in generic documents is very difficult.

What else you can do

Instead of content place holders like your @customer& you can use AcroForm form fields.

Form fields have names and can be recognized by them. Filling them in does not change anything in the content.

If you don't want people afterwards to edit your PDF form fields, you can mark them as read-only or even flatten them into the content.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • Thanks a lot for the well explained answer. – drunkenfist Mar 24 '15 at 19:53
  • It's been a long time since I asked this question, but I have one more problem. As you suggested, I used AcroFroms to fill the dynamic texts and made them read only. The problem is, when I send the pdf as an email, people on iphone / ipad are not able to see the from contents. Looks like this is a known issue (https://forums.adobe.com/thread/1216563). Any suggestions for this? – drunkenfist Jun 27 '15 at 00:43
  • Would flattening the pdf help? I'm not able to find any good code for flattening a pdf. Could you provide me with any links to flatten a pdf? I had a look at this link http://stackoverflow.com/questions/14454387/pdfbox-how-to-flatten-a-pdf-form but none of them worked properly for me. – drunkenfist Jun 27 '15 at 04:09
  • Flattening forms would help, yes, but I haven't yet done that using pdfbox. Do your form elements have appearance streams? – mkl Jun 27 '15 at 13:10
  • I'm not sure what that is. I have empty test fields with styling (such as font and size) which I replace with plain text. I saw that PDFClown has in built flattening support. But that reads from a file on the system. I have a byte array and it doesn't look like PDFClowm supports reading from bytes directly. – drunkenfist Jun 27 '15 at 23:59
  • *I'm not sure what that is.* - form fields can either provide an own content stream for display or they can rely on the PDF viewer to construct a content stream from their bare value. Many more simple form flattening algorithms count on such content streams to be there. *doesn't look like PDFClowm supports reading from bytes directly* - it at least supports reading from streams, you can use a `ByteArrayInputstream`. – mkl Jun 28 '15 at 07:12
  • It supports reading from IInputStream (not InputStream). I checked the PDFClown docs, and only FileInputStream implements IInputStream. – drunkenfist Jun 28 '15 at 09:51
  • [This answer](http://stackoverflow.com/a/28826600/1729265) explains how to do it in c#. The same should be possible in java. – mkl Jun 28 '15 at 11:16