How to replace text in a pdf with correct encoding using Itext

Question

I create a java program for translating PDFs. I am using google API for translation. I am getting the translation correct on my Eclipse IDE Console but when I check the newly created pdf, either it's not translated and copied as it is or few words are translated or the new pdf comes as empty and sometimes corrupted.

I suppose it has something to do with encoding & font types.

I have already gone through the Itext page & all the related questions but none worked for my case. I am trying to translate Portuguese Spanish Finnish French Hungarian, etc into English.

Here is my code:

public static final String SRC = "5587309Finnish.pdf";  

public static final String DEST = "changed.pdf";


    public static void main(String[] args) throws java.io.IOException, DocumentException {

        Translate translate = TranslateOptions.getDefaultInstance().getService();
        PdfReader reader = new PdfReader(SRC);
        int pages = reader.getNumberOfPages(); 
        PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(DEST));
        for(int i=1;i<=pages;i++) {
        PdfDictionary dict = reader.getPageN(i);

        PdfObject object = dict.getDirectObject(PdfName.CONTENTS);

        if (object instanceof PRStream) {
            String pageContent = 
                    PdfTextExtractor.getTextFromPage(reader, i);
            String[] word = pageContent.split(" ");

            PRStream stream = (PRStream) object;
            byte[] data = PdfReader.getStreamBytes(stream);

              String dd = new String(data, BaseFont.CP1252);


              for (int j=0; j < word.length; j++)
                {

                  Translation translation = translate.translate(word[j],Translate.TranslateOption.sourceLanguage("fi"), 
                          Translate.TranslateOption.targetLanguage("en"));
                 System.out.println(word[j]+"-->>"+translation.getTranslatedText());//here i can check the translation is correct.
                   dd = dd.replace(word[j],translation.getTranslatedText());




                }

              stream.setData(dd.getBytes());


        }
        }

        stamper.close();
        reader.close();

    }

Please help.

Using System.out is a bad idea, as it uses the platform encoding over which you have no control. Unless you are using Linux with UTF-8. Write to file in UTF-8 and you will be able to check everything using NotePad++ or such. — Joop Eggen, Oct 25 '19 at 13:34
@JoopEggen I'm using System.out for just printing out in the console IDE and that is coming correct. I'm trying to write on a destination pdf with the translated text of the source pdf. And to write I'm using thisline >> dd = dd.replace(word[j],translation.getTranslatedText()); Also I tried writing in the new pdf using UTF-8, but the pdf was empty. — pikapi, Oct 26 '19 at 14:19
Still no answers... So you have to debug more intensively. Split the task: extracting the text; translating; creating a PDF with PdfStamper. At least you could narrow the problem down, have already a working solution for the other parts. Demand a more diredted question. Look up samples of code. Good luck. — Joop Eggen, Oct 28 '19 at 08:03
Working immediately in a PDF content stream is very tricky. Encodings of texts may differ from font to font and may be pretty arbitrary. Furthermore, the fonts may be embedded only as a subset, making a replacement with your desired text impossible using the same PDF font. Furthermore, words may be drawn only piece-wise to allow for kerning steps, so even ignoring the encoding and subsetting your words might be hard to find. For an idea of the hindrances to content stream editing and an approach idea to overcome them, see [this answer](https://stackoverflow.com/a/58593586/1729265). — mkl, Oct 28 '19 at 15:27
@mkl For the document that I'm using in this case, using Adobe Acrobat, when I check Document Properties >>Fonts, I have ::: (1) Helvetica, Type: Type1, Encoding: Ansi, Actual Font: ArialMT, Actual Font Type : TrueType (2) Helvetica-Bold, Type: Type1, Encoding: Ansi, Actual Font: Arial-BoldMT, Actual Font Type : TrueType (3) Helvetica-Oblique, Type: Type1, Encoding: Ansi, Actual Font: Arial-ItalicMT, Actual Font Type : TrueType. So I know this kind of information about my pdf files by checking manually. So even now is it impossible? Can I not replace whole pdf using this information? — pikapi, Oct 29 '19 at 06:40
@mkl Also I'm not trying to replace in the same pdf file, I'm extracting text from one, translate it and then trying to write on a new pdf file.Is this approach not correct? — pikapi, Oct 29 '19 at 06:41
*"For the document that I'm using in this case..."* - your description sounds like only standard 14 fonts are used and they are used with WinAnsiEncoding. This surely would make things easier. But the text words still are not necessarily in a single piece (for kerning purposes); as you apparently only have to handle a single document, check this by outputting your 'dd' strings and inspecting them. — mkl, Oct 29 '19 at 11:25
*"Also I'm not trying to replace in the same pdf file, I'm extracting text from one, translate it and then trying to write on a new pdf file."* - Well, your code reads a content stream, manipulates it, and writes it back. That is not creation of a new PDF but manipulation of an existing one, even if you store the result with a different name. By the way, your code ignores one thing completely: The content stream does not only contain the text drawn on the page but also instructions that control how the text shall be drawn. So your code also applies translation to instructions... destructively. — mkl, Oct 29 '19 at 11:40
@mkl I tried editing the content stream directly by replacing the text before Tj. This works almost okay only problem is few of the text is bold and my code is not able to replace the bold text completely. I haven't set any encoding I just want it to be in the same encoding as it is. — pikapi, Oct 31 '19 at 10:28
@mkl Also I don't know how to get the content stream for pdf files having IDENTITY_H as encoding or some other encoding ?My code gives following error when I use String dd = new String(data, BaseFont.IDENTITY_H); Exception in thread "main" java.io.UnsupportedEncodingException: Times-Roman at java.base/java.lang.StringCoding.decode(StringCoding.java:243) at java.base/java.lang.String.(String.java:467) at java.base/java.lang.String.(String.java:537) at Td_Tj.main(Td_Tj.java:51) Do I need some extra files to be downloaded? — pikapi, Oct 31 '19 at 10:31
*"This works almost okay only problem is few of the text is bold and my code is not able to replace the bold text completely"* - I don't understand what you mean here. *"I haven't set any encoding I just want it to be in the same encoding as it is"* - Above you listed for the document you are using here only one encoding, Ansi, i.e. WinAnsiEncoding, both for Helvetica and Helvetica-Bold. So, the same encoding. — mkl, Oct 31 '19 at 12:44
*"Also I don't know how to get the content stream for pdf files having IDENTITY_H as encoding or some other encoding ?"* - You get the content stream just as before as `PRStream stream`. What you appear to mean, though, is how to transform it into a single editable string. Here the answer is: *You don't.* As soon as you have different and in particular non-ASCII-ish encodings, *your current approach does not make any sense any more.* You will have to parse the instructions in the content stream bytes to always know the current font and process string arguments using the encoding of that font. — mkl, Oct 31 '19 at 12:54
@mkl The thing is string like > Förfallodatum is read like this by eclipse >> [( )250(F\366rfallodatum)]TJ. So how do I replace this string with translated text? — pikapi, Nov 04 '19 at 08:45
Use the `PdfContentStreamEditor` from [this question](https://stackoverflow.com/a/35915789/1729265) and replace text in string arguments of text showing operations. — mkl, Nov 04 '19 at 09:46
@mkl I don't understand how I'll be able to edit using PdfContentStreamEditor . I have my updated code here. https://drive.google.com/file/d/1v76lKoTavu_lM0WKMsRH6dp-ypldLBR0/view?usp=sharing I am getting the update dd(i.e. content stream which I am printing) correctly with the replaced text. I don't know why I am getting a blank pdf — pikapi, Nov 06 '19 at 09:17
@mkl What I want is to translate one pdf and get another pdf, provided the indentation & formatting remain same. Is it really not possible to do this using itext & google-api? — pikapi, Nov 07 '19 at 11:27
*"I have my updated code here..."* - it still looks like you attempt to get around the need of properly parsing the content stream. If the `PdfContentStreamEditor` usage is not clear, you may want to use a `PRTokeniser` and `PdfContentParser` for parsing. That way you have less information at your hand but there also are less complications in the architecture. If you can share a representative example PDF for your use case, I'll check whether those approaches make sense, and if they do, I'll show the basic usage of those classes for changing contents. — mkl, Nov 13 '19 at 15:20

score 0 · Answer 1 · answered Nov 13 '19 at 18:14

According to a comment you have improved your code and are

getting the update dd(i.e. content stream which I am printing) correctly with the replaced text. I don't know why I am getting a blank pdf

Thus, I assume that your (hopefully representative) test PDFs have all their fonts of interest encoded in ANSI'ish encodings and the text arguments of the text drawing instructions contain whole words or even phrases which can properly be processed because otherwise text replacement would not have been possible.

Thus, here an example how one can replace text pieces with similarly long ones under such benign circumstances without breaking the content stream syntax. In this example I simply use a Map containing replacement strings. You can do your translation there.

First a frame loading the source, creating a stamper, iterating over the pages, and calling a helper to create a content stream replacement:

Map<String, String> replacements = new HashMap<>();
replacements.put("Förfallodatum", "Ablaufdatum");

try (   InputStream resource = SOURCE_INPUTSTREAM;
        OutputStream result = new FileOutputStream(RESULT_FILE)  ) {
    PdfReader pdfReader = new PdfReader(resource);
    PdfStamper pdfStamper = new PdfStamper(pdfReader, result);
    for (int pageNum = 1; pageNum <= pdfReader.getNumberOfPages(); pageNum++) {
        PdfDictionary page = pdfReader.getPageN(pageNum);
        byte[] pageContentInput = ContentByteUtils.getContentBytesForPage(pdfReader, pageNum);
        page.remove(PdfName.CONTENTS);
        replaceInStringArguments(pageContentInput, pdfStamper.getUnderContent(pageNum), replacements);
    }
    pdfStamper.close();
}

(EditPageContentSimple test testReplaceInStringArgumentsForklaringAvFakturan)

The method replaceInStringArguments now parses the instructions in the given content stream, isolates string arguments, and calls another helper for each string argument doing the replacement.

void replaceInStringArguments(byte[] contentBytesBefore, PdfContentByte canvas, Map<String, String> replacements) throws IOException {
    PRTokeniser tokeniser = new PRTokeniser(new RandomAccessFileOrArray(new RandomAccessSourceFactory().createSource(contentBytesBefore)));
    PdfContentParser ps = new PdfContentParser(tokeniser);
    ArrayList<PdfObject> operands = new ArrayList<PdfObject>();
    while (ps.parse(operands).size() > 0){
        for (int i = 0; i < operands.size(); i++) {
            PdfObject pdfObject = operands.get(i);
            if (pdfObject instanceof PdfString) {
                operands.set(i, replaceInString((PdfString)pdfObject, replacements));
            } else if (pdfObject instanceof PdfArray) {
                PdfArray pdfArray = (PdfArray) pdfObject;
                for (int j = 0; j < pdfArray.size(); j++) {
                    PdfObject arrayObject = pdfArray.getPdfObject(j);
                    if (arrayObject instanceof PdfString) {
                        pdfArray.set(j, replaceInString((PdfString)arrayObject, replacements));
                    }
                }
            }
        }
        for (PdfObject object : operands)
        {
            object.toPdf(canvas.getPdfWriter(), canvas.getInternalBuffer());
            canvas.getInternalBuffer().append((byte) ' ');
        }
        canvas.getInternalBuffer().append((byte) '\n');
    }
}

(EditPageContentSimple helper method)

The method replaceInString in turn retrieves a single string operand (a PdfString instance), manipulates it, and returns the manipulated string version:

PdfString replaceInString(PdfString string, Map<String, String> replacements) {
    String value = PdfEncodings.convertToString(string.getBytes(), PdfObject.TEXT_PDFDOCENCODING);
    for (Map.Entry<String, String> entry : replacements.entrySet()) {
        value = value.replace(entry.getKey(), entry.getValue());
    }
    return new PdfString(PdfEncodings.convertToBytes(value, PdfObject.TEXT_PDFDOCENCODING));
}

(EditPageContentSimple helper method)

Instead of that for loop here you would call your translation routine and translate value.

As has been mentioned before, this code only works under certain benign circumstances. Don't expect it to work for arbitrary documents from the wild, in particular not for documents with other than Western European glyphs.

I tried your code and its working for pdf with ANSI encoding, but due to words not being together in one <....>Tj words weren't translated properly ( eg:Swedish pdf in link ) I have attached a link to few pdfs and the final code https://drive.google.com/drive/folders/1yoSbizqLAVjO8LfQP6QPoePYiP2FHeI9?usp=sharing Also I know you said you assumed all the fonts of interest will be encoded but its not the case for few pdfs. Also what changes will I have to make for IDENTITY-H encoding pdfs (eg: Portuguese pdf) to work because as of now I get ? in place of text in the converted pdf. — pikapi, Nov 21 '19 at 06:35
@pikapi The code given in the answer explicitly requires *benign circumstances*, and it does so because a generic solution will be far beyond the scope of a stack overflow question. Actually you already see some limitations of the simple approach taken here in your example PDFs, e.g. artifacts due to missing glyphs in subset embedded fonts. When I find some time, I'll have a look at the other example PDFs and try to find an easy solution for such files. A truly generic solution, though, remains beyond what you can hope for here. — mkl, Nov 21 '19 at 13:58
For the missing glyphs would it be easier to just provide a font for the new pdf and write it in that particular font. Keeping the size same as earlier pdf only font family would change. This way I can be safe from missing glyphs and don't have to worry about the pdf fonts being subset, right? But as I am using stream.setData() where do I set the font for the new pdf. — pikapi, Nov 22 '19 at 07:56
*"For the missing glyphs would it be easier to just provide a font for the new pdf"* - Yes, you need a new font. In the code above only text showing operations are changed (implicitly, as text showing operations are the only operations with strings as immediate parameters or entries in an immediate array parameter). You can also check whether an operation is a font setting operation (**Tf** operator) and replace the the first parameter accordingly. — mkl, Nov 22 '19 at 10:01

How to replace text in a pdf with correct encoding using Itext

1 Answers1