Replace string in PDF file using Itext but letter X not replace

Question

I'm trying to replace the content of PDF in one text but the letter 'X' are not being replaced.

public static void main(String[] args) {

    String DEST = "/home/diego/Documentos/teste.pdf";

    try {
        PdfReader reader = new PdfReader("termoAdesaoCartao.pdf");
        PdfDictionary dictionary = reader.getPageN(1);
        PdfObject object = dictionary.getDirectObject(PdfName.CONTENTS);
        if (object instanceof PRStream) {
            PRStream stream = (PRStream)object;
            byte[] data = PdfReader.getStreamBytes(stream);
            stream.setData(new String(data).replace("Nome Completo", "A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z").getBytes());
        }
        PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(DEST));
        stamper.close();
        reader.close();
    } catch (IOException | DocumentException e) {
        e.printStackTrace();
    }

}

Most likely the font is only partially embedded and the capital letters K, W, X, and Y are not embedded. Editing content like you do works in very fortunate circumstances only. — mkl, Dec 12 '15 at 14:09
Furthermore, as indicated in Viacheslav Vedenin's answer, encoding and decoding strings without explicitly selecting an encoding is a bad idea in general. But assuming the content stream of a PDF page to be utf-8 encoded is not much better. In general it does not even make sense to assume a single encoding for the whole content stream, one had to use the encoding of the font selected for the respective content stream part. — mkl, Dec 12 '15 at 14:27
@DiegoMacario You say you have the same problem. Thus, maybe you can provide a sample PDF representative in your use case. Concerning the OP's PDFs we could only guess and look where the described symptoms point. An actual sample would allow for a showcase. — mkl, Dec 18 '15 at 12:51
@mkl my example is near from the author, but I did different thing, I got the charset of the file before to replace the string but didn´t worked. — Diego Macario, Dec 20 '15 at 22:55
@Diego *I got the charset of the file* - I'm not sure i understand you correctly. As explained in my answer, there generally **is no single charset** to decode a content stream with. I asked for the very file to check whether there may be a work around for it. — mkl, Dec 21 '15 at 05:18
@mkl I understood what you told, I´m going to develop as your example. — Diego Macario, Dec 22 '15 at 13:46
@DiegoMacario If you happen to find show-stoppers, don't hesitate to ask for more details. — mkl, Dec 22 '15 at 14:04

score 7 · Answer 1 · answered Dec 16 '15 at 15:31

In general

Basically the OP's approach in general cannot work. There are two major misunderstandings his code is built upon:

He assumes that one can translate a complete content stream from byte[] to String (with all string parameters of text showing operators being legible) using a single character encoding.

This assumption is wrong: Each font may have its own encoding, so if multiple fonts are used on the same page, the same byte value in string operands of different text showing operators may represent completely different characters. Actually the fonts do not even need to contain a mapping to characters, they merely need to map numeric values to glyph painting instructions.

Cf. section 9.4.3 Text-Showing Operators in ISO 32000-1:

A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.

With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".

With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap,

Simple PDF generators often merely use standard encodings (which are ASCII'ish and may give rise to assumptions like the OP's one) but there are more and more non-simple PDF generators out there...
He assumes he can simply edit the string operands of text-showing operators and the matching glyphs will be shown in the PDF viewer.

This assumption is wrong: Fonts usually only support a fairly limited character set, and a text showing operator uses only a single font, the currently selected one. If one replaces a code in a string argument of such an operator with a different one without a matching glyph in the font, one will at most see a gap!

While complete fonts usually at least contain glyphs for all characters of a kind (e.g. latin letters with all Western European variations thereof), PDF allows embedding fonts partially, cf.section 9.6.4 Font Subsets in ISO 32000-1:

PDF documents may include subsets of Type 1 and TrueType fonts.

This option meanwhile often is used to only embed painting instructions for glyphs actually used in the existing text. Thus, one cannot count on embedded fonts containing all characters of the same kind if they contain some. There may be a glyph for A and C but not for B.

In the case at hand

Unfortunately the OP has not supplied his sample PDF. The symptoms , though:

his call replace("Nome Completo", "A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z") makes a difference as can be seen in his screenshot

and his comment to Viacheslav Vedenin's answer

Before the text was (Nome Completo)Tj and after (A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z)Tj
but some codes do not show as the expected glyphs as can also be seen in the screenshot above

point in the direction that the latter one of his two major false assumption described above makes the OP's code fail him: Most likely the font in question uses a standard encoding (probably WinAnsiEncoding) but is only partially embedded, in particular without the capital letters K, W, X, and Y.

How to do it correctly

Instead of blindly editing the content stream, the OP (who already is using iText) can use the following iText concepts:

text extraction classes can be used to also extract coordinates of text, cf multiple answers on stackoverflow, in particular the bounding rectangle of the text he wants to replace;
the iText xtra library class PdfCleanUpProcessor can be used to remove all content existing in that bounding rectangle;
the PdfStamper.getOverContent() can then be used to properly add new content at those coordinates.

This may sound complicated but this takes care of a number of additional minor misconceptions visible in the OP's approach.

Slava Vedenin · Answer 2 · 2015-12-12T12:10:01.340

Try to use instead of

stream.setData(new String(data).replace("Nome Completo", "A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z").getBytes());

following code

stream.setData(new String(data, "UTF8").replace("Nome Completo", "A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z").getBytes("UTF8"));

Accoring this post in Oracle manual using new String(data) and getBytes() can lead to some error:

Byte Encodings and Strings

If a byte array contains non-Unicode text, you can convert the text to Unicode with one of the String constructor methods. Conversely, you can convert a String object into a byte array of non-Unicode characters with the String.getBytes method. When invoking either of these methods, you specify the encoding identifier as one of the parameters.

The example that follows converts characters between UTF-8 and Unicode. UTF-8 is a transmission format for Unicode that is safe for UNIX file systems. The full source code for the example is in the file StringConverter.java.

Update: If it isn't working, can you replace code

byte[] data = PdfReader.getStreamBytes(stream);
stream.setData(new String(data).replace("Nome Completo", "A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z").getBytes());

to code

byte[] data = PdfReader.getStreamBytes(stream);
String str = new String(data);
System.out.printLn(str);
String newStr = str.replace("Nome Completo", "A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z"); 
System.out.printLn(newStr);
stream.setData(newStr.getBytes());

And write what you show in console?

Before the text was `(Nome Completo)Tj` and after `(A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z)Tj `. I didn´t put all the files becuse generated to many text — user3503888, Dec 12 '15 at 12:29
While encoding and decoding strings without explicitly selecting an encoding indeed is a bad idea, assuming the content stream of a PDF page to be utf-8 encoded is very creative. — mkl, Dec 12 '15 at 14:21
I got my `charset` with this `InputStreamReader inputStreamReader = new InputStreamReader(new FileInputStream(new File("termoAdesaoCartao.pdf"))); String charset = inputStreamReader.getEncoding();` and return `ISO8859_1` and now the console shows `(A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z)Tj` but not save the letter. — user3503888, Dec 12 '15 at 16:25

score 0 · Answer 3 · answered Oct 11 '18 at 20:19

I modified the code found a bit and it was working as follows

public static final String SRC = "C:/tmp/244558.pdf";
public static final String DEST = "C:/tmp/244558-2.pdf";

public static void main(String[] args) throws IOException, DocumentException {
    File file = new File(DEST);
    file.getParentFile().mkdirs();
    new Main().manipulatePdf(SRC, DEST);
}

public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
    PdfReader reader = new PdfReader(src);
    PdfDictionary dict = reader.getPageN(1);
    PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
    PdfArray refs = null;
    if (dict.get(PdfName.CONTENTS).isArray()) {
        refs = dict.getAsArray(PdfName.CONTENTS);
    } else if (dict.get(PdfName.CONTENTS).isIndirect()) {
        refs = new PdfArray(dict.get(PdfName.CONTENTS));
    }
    for (int i = 0; i < refs.getArrayList().size(); i++) {
        PRStream stream = (PRStream) refs.getDirectObject(i);
        byte[] data = PdfReader.getStreamBytes(stream);
        stream.setData(new String(data).replace("Data replace", "Data").getBytes());
    }
    PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
    stamper.close();
    reader.close();
}

Replace string in PDF file using Itext but letter X not replace

3 Answers3

In general

In the case at hand

How to do it correctly

Linked