0

I am reading a pdf and updating the text of the pdf. It seems to work fine when I replace it with English word but when I replace it with Arabic word, it doesn't work.

In my case, the PdfObject would always be of type Indirect so dict.get(PdfName.CONTENTS).isArray() would be false in all cases

public static void manipulatePdf(String src, String dest) throws IOException, DocumentException {
    PdfReader reader = new PdfReader(src);
    PdfDictionary dict = reader.getPageN(1);
    PdfObject object = dict.getDirectObject(PdfName.CONTENTS);


    if (object instanceof PRStream) {
        PRStream stream = (PRStream) object;
        byte[] data = PdfReader.getStreamBytes(stream);

        String eredeti = "اختبارات";
        String arabicWord = new String(eredeti.getBytes());

        stream.setData(new String(data).replace("testing", arabicWord ).getBytes("ISO-8859-6"));
    }
    PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
    stamper.close();


    reader.close();
}
Danyal Sandeelo
  • 12,196
  • 10
  • 47
  • 78
  • That kind of replacement code works at best in a very limited manner in some documents. In many documents it doesn't work at all. At worst it actually damages the content. – mkl Mar 31 '19 at 10:49
  • thanks for the prompt response @mkl . I would like to your opinion on how to proceed with text replacement. It's working all fine with English, I can't replace Arabic words. Any suggestions please? – Danyal Sandeelo Mar 31 '19 at 10:51
  • even using ""\u0627\u0644\u0645\" string won't show the data – Danyal Sandeelo Mar 31 '19 at 10:53
  • *"It's working all fine with English"* - that's merely by luck: for the code to work at all the pdf must be special: all fonts need to have an encoding which is like your jvm's standard encoding and no kerning must be applied. Furthermore, the **Contents** object must be a stream (not an array of streams which also is valid pdf). Simple pdfs usually use the standard 14 fonts (fonts each pdf viewer must bring along) with **WinAnsiEncoding**. Thus, English (and other western European languages) usually in such simple pdfs can be replaced using your code. – mkl Mar 31 '19 at 13:33
  • As soon, though, as other languages than western European ones are involved, your code doesn't work at all. As soon as kerning is applied, also not. As soon as other fonts are used and subset embedded, also not. As soon as **Contents** arrays are used, also not. The only somewhat clean way to replace text in pdf page content is to apply redaction-like code to remove the original text and then draw the replacement over it as new text. – mkl Mar 31 '19 at 13:38
  • @mkl Thanks for the reply. I wrote another implementation that works smoothly. I used `Acrofields` instead of placeholders so I just grab the acrofields, update value of the fields, generate stream and save it into another pdf. – Danyal Sandeelo Apr 01 '19 at 05:53
  • which pdf library would you suggest me for Java? I am using itext, how is pdfbox? I can get an idea but definitely expert opinion matters a lot – Danyal Sandeelo Apr 01 '19 at 05:54
  • *"which pdf library would you suggest me for Java"* - it depends on your requirements. E.g. comparing iText and PDFBox, if you want to create new PDFs and expect the library to do layout for you, you'll choose iText; if you want to render PDF pages to bitmap, you'll choose PDFBox; if you cannot pay for a commercial license but want to keep your code closed, you'll choose PDFBox; if you want a support contract, you'll choose iText; etc. pp. – mkl Apr 02 '19 at 13:13
  • @mkl I had to update the form fields using Java. I did it via itext as well as via pdfBox. I had faced some issues regarding Arabic fonts to be assigned to acro fields (text fields) in PDFBox but fixed them somehow. I would go for PDFBox because it's free. https://stackoverflow.com/questions/55451551/unable-ot-save-arabic-words-in-a-pdf-pdfbox-java – Danyal Sandeelo Apr 02 '19 at 13:16

0 Answers0