How to extract PRStream from a pdf having built-in encoding using itext?

Question

I need to replace the text in the original pdf and create a new one. For that I am using itext library in java. Till now I only had PDFs having ANSI encoding. So I would run the following lines :

            PdfReader reader = new PdfReader(SOURCE_PDF);
            PdfDictionary page = reader.getPageN(1);
            byte[] pageContentInput = ContentByteUtils.getContentBytesForPage(reader, 1);
            String dd = new String(pageContentInput, BaseFont.CP1252);

BaseFont.CP1252 helped me to decode the encoding & I would get the text in the string "dd". If I use CP1252 the outcome is somewhat like this Tj which in ANSI case is Tj

Also I not only require text from the page but also the whole formatting i.e. with Tj, Tf etc. so that I can create a new pdf with same formatting. That's why I am using getContentBytesForPage.

How can I get the PDF Text Stream out of the pdf having built-in encoding?

Please read [this answer](https://stackoverflow.com/a/60655298/1729265) to understand why in general your task is very difficult to implement if at all. So unless you want to process only very specific pdfs subject to a number of restrictions, you should find a different approach to your use case, e.g. use of acroform fields. — mkl, Mar 17 '20 at 05:42
@mkl I know its not easy and there are so many variables. But if I were to create a generic solution, how do you suppose I use unicode mapping to extraxt text from built-in encoding, because I do have unicode mapping attached to the pdf I just dont know if it will be useful? If it is then I dont know how to use it in the code. — pikapi, Mar 18 '20 at 05:20
In particular you don't apply a single encoding for the whole byte array, each string object therein can be encoded differently. You have to parse the byte array instruction by instruction, keep track of which font currently is selected, and when when you encounter a text drawing instruction, its string arguments have to be decoded according to the properties of that current font. The properties to use may be its **Encoding**, its **ToUnicode** map, information from the underlying font file,... depending on which font type it is and which optional information are given. — mkl, Mar 18 '20 at 05:46
@mkl can you give any simple example of how to get information from the properties of current font? — pikapi, Mar 19 '20 at 07:15
I wrote an answer pointing to a number of older answers providing a low-level stream editing class and usage examples, among others deleting text matching a search text or some text style properties. They retrieve the properties of the current font under the hood. — mkl, Mar 20 '20 at 09:52

score 0 · Answer 1 · answered Mar 19 '20 at 11:14

As already mentioned in comments, you don't use a single encoding to decode the whole byte array because each string object therein can be encoded differently.

You have to parse the byte array instruction by instruction, keep track of which font currently is selected, and when when you encounter a text drawing instruction, its string arguments have to be decoded according to the properties of that current font.

The properties to use may be its Encoding, its ToUnicode map, information from the underlying font file,... depending on which font type it is and which optional information are given.

But even after doing so, you cannot simply replace the text in the original pdf, this answer (to a similar question in the context of the PDFBox library) illustrates a number of hindrances, in particular fonts (which may be subset-embedded only) not containing the glyphs you need and unclear layout considerations.

To get an idea how to address the former issues, have a look at the following answers:

This answer which provides PdfContentStreamEditor classes for Java and C# which can serve as base classes to edit content stream instructions; these classes in particular also keep track of the graphics state including the current text state parameters.
This answer (the OP unfortunately deleted the question, so you need some reputation to have permission to read the answer) uses that PdfContentStreamEditor Java class to implement a text remover for text in a specific font and another one for text with a large font size.
This answer uses that PdfContentStreamEditor C# class to implement a BigTextRemover which recognizes text by its font size and removes it.
This answer describes what to do to prevent PdfContentStreamEditor issues with rotated documents.
This answer also describes what to do to prevent PdfContentStreamEditor issues with rotated documents and additionally fixes a bug in the PdfContentStreamEditor.
This answer uses that PdfContentStreamEditor Java class to implement an editor that changes the color of black text to green.
This answer provides a port of the PdfContentStreamEditor to iText 7 / Java as PdfCanvasEditor and shows example usages removing text by font name or font size and re-coloring black text to green.
This answer uses that PdfContentStreamEditor C# class to implement a TextRemover removing all text drawing instructions.
This answer uses that PdfContentStreamEditor Java class to implement a SimpleTextRemover which recognizes a search text in text drawing instructions, removes it, and returns the positions at which the text was removed (under some restrictions explained there). At those positions one then can draw new text.

Studying the PdfContentStreamEditor from the first answer (with the fix from the fifth answer) and the SimpleTextRemover you get an idea how to find text. The other answers might be interesting in general if you want to edit PDFs in different ways.

As far as replacing goes, consider that fonts may be incomplete and you, therefore, in general cannot simply replace the contents of the string arguments of text drawing instructions but instead may have to add a new font and switch fonts for the replacement text drawing instruction.

How to extract PRStream from a pdf having built-in encoding using itext?

1 Answers1