getting wrong arabic translation in PDF iText

Question

I am generating PDF file from my HTML string, But when PDF file getting generated the content in HTML and PDF does not match. The content is PDF is some random content. I read about the issue on google and they suggest using Unicode notation like %u0627%u0646%u0627%20%u0627%u0633%u0645%u0649%20%u0639%u0628%u062F%u0627%u0644%u0644%u0647. But I am putting this into my HTML it is getting printing as it is.

related issue: Writing Arabic in pdf using itext

package com.example.demo;

import com.itextpdf.html2pdf.ConverterProperties;
import com.itextpdf.html2pdf.HtmlConverter;
import com.itextpdf.styledxmlparser.css.media.MediaDeviceDescription;
import com.itextpdf.styledxmlparser.css.media.MediaType;
import com.itextpdf.html2pdf.resolver.font.DefaultFontProvider;
import com.itextpdf.layout.font.FontProvider;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;

@SpringBootApplication
public class DemoApplication {

    public static void main(String[] args) throws IOException {
        SpringApplication.run(DemoApplication.class, args);
        String htmlSource = getContent();
        ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
        ConverterProperties converterProperties = new ConverterProperties();
        FontProvider dfp = new DefaultFontProvider(true, false, false);
        dfp.addFont("/Library/Fonts/Arial.ttf");
        converterProperties.setFontProvider(dfp);
        converterProperties.setMediaDeviceDescription(new MediaDeviceDescription(MediaType.PRINT));
        HtmlConverter.convertToPdf(htmlSource, outputStream, converterProperties);
        byte[] bytes = outputStream.toByteArray();
        File pdfFile = new File("java19.pdf");
        FileOutputStream fos = new FileOutputStream(pdfFile);
        fos.write(bytes);
        fos.flush();
        fos.close();
    }

    private static String getContent() {
        return "<!DOCTYPE html>\n" +
                "<html lang=\"en\">\n" +
                "\n" +
                "<head>\n" +
                "    <meta charset=\"UTF-8\">\n" +
                "    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n" +
                "    <meta http-equiv=\"X-UA-Compatible\" content=\"ie=edge\">\n" +
                "    <title>Document</title>\n" +
                "    <style>\n" +
                "      @page {\n" +
                "        margin: 0;\n" +
                "        font-family: arial;\n" +
                "      }\n" +
                "    </style>\n" +
                "</head>\n" +
                "\n" +
                "<body\n" +
                "    style=\"margin: 0;padding: 0;font-family: arial, sans-serif;font-size: 14px;line-height: 125%;width: 100%;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #222222;\">\n" +
                "    <table cellpadding=\"0\" cellspacing=\"0\" width=\"100%\" style=\"background: white; direction: rtl;\">\n" +
                "        <tbody>\n" +
                "            <tr>\n" +
                "                <td style=\"padding: 0 35px;\">\n" +
                "                    <p> انا اسمى عبدالله\n" +
                "                    </p>\n" +
                "                </td>\n" +
                "            </tr>\n" +
                "        </tbody>\n" +
                "    </table>\n" +
                "\n" +
                "</body>\n" +
                "\n" +
                "</html>";
    }
}

The issue you linked is 5 years old and is about iText 5. You are using iText 7 + pdfHTML, so the linked issue may not apply to you. — Amedee Van Gasse, May 19 '20 at 18:10
Please attach the resultant PDF. Are you using pdfCalligraph? — Alexey Subach, May 19 '20 at 22:40
Check this thread. https://stackoverflow.com/q/61814632/13528037 — Natsu, May 27 '20 at 19:57

JohannesB · Answer 1 · 2020-05-25T10:03:50.627

Please check to make sure that your sourcefile and compiler use the same encoding, e.g. UTF-8. I sometimes check that by including characters that are only available in unicode and not in other classic codepages.

I tried to reproduce the issue and I got the following warning in the logging when running the example code:

Cannot find pdfCalligraph module, which was implicitly required by one of the layout properties

This was already mentioned by Alexsey Subach and can cause the following issue:

Problems with text direction (I am no expert on Arabic but the text was aligned to the right)
Wrong combination of characters (For the details see this document: https://itextpdf.com/sites/default/files/2018-12/iText_pdfCalligraph_4pager.pdf )

This is the output I got without pdfCalligraph:

pdf result without calligraph

Created with the codebase on this repository

So in order to get everything to work perfectly like your browser does with the HTML for Arabic you will also need:

A commercial license for https://itextpdf.com/en/products/itext-7/pdfcalligraph
Code to load the license file (or you will get a LicenseFileNotLoadedException )
This dependency https://repo.itextsupport.com/releases/com/itextpdf/typography/2.0.6/

Your question is tagged as regarding iText7 but there may be other possible free alternatives depending on your requirements like Apache FOP that should work with Arabic Ligatures according to this source but probably require rework as it is based on XSL-FO. In theory you could generate the XSL-FO with any templating mechanism that you currently use e.g.: JSP/JSF/Thymeleaf etc. and use something like a ServletFilter to convert the XSL-FO to a PDF on the fly during a request (in a web application)

rhens · Accepted Answer · 2020-05-28T17:33:48.360

It's difficult to determine what the issue is exactly without seeing the faulty output. But your "random content" sounds like an encoding issue.

Since you have your Arabic content directly in your source code, you have to be careful about encoding. For example, using ISO-8859-1, the resulting PDF output is:

Using Unicode escape sequences (\uXXXX), you can indeed avoid some of these encoding issues. Replacing

"                    <p> انا اسمى عبدالله\n" +

with

"                    <p>\u0627\u0646\u0627 \u0627\u0633\u0645\u0649 \u0639\u0628\u062F\u0627\u0644\u0644" +

results in Arabic glyphs, even when using ISO-8859-1 encoding. Alternatively, you can use UTF-8 to get the correct content regardless of the use of Unicode escape sequences.

When your encoding issues are solved, you will likely get output like this:

For correct rendering of certain writing systems, an optional module pdfCalligraph is needed for iText 7. With this module enabled, the resulting output looks like this:

The code used for the tests above:

public static void main(String[] args) throws IOException {
    // Needed for pdfCalligraph
    LicenseKey.loadLicenseFile("all-products.xml");

    File pdfFile = new File("java19.pdf");
    OutputStream outputStream = new FileOutputStream(pdfFile);
    String htmlSource = getContent();
    ConverterProperties converterProperties = new ConverterProperties();
    FontProvider dfp = new DefaultFontProvider(true, false, false);
    dfp.addFont("/Library/Fonts/Arial.ttf");
    converterProperties.setFontProvider(dfp);
    converterProperties.setMediaDeviceDescription(new MediaDeviceDescription(MediaType.PRINT));
    HtmlConverter.convertToPdf(htmlSource, outputStream, converterProperties);
}

private static String getContent() {
    return "<!DOCTYPE html>\n" +
            "<html lang=\"en\">\n" +
            "\n" +
            "<head>\n" +
            "    <meta charset=\"UTF-8\">\n" +
            "    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n" +
            "    <meta http-equiv=\"X-UA-Compatible\" content=\"ie=edge\">\n" +
            "    <title>Document</title>\n" +
            "    <style>\n" +
            "      @page {\n" +
            "        margin: 0;\n" +
            "        font-family: arial;\n" +
            "      }\n" +
            "    </style>\n" +
            "</head>\n" +
            "\n" +
            "<body\n" +
            "    style=\"margin: 0;padding: 0;font-family: arial, sans-serif;font-size: 14px;line-height: 125%;width: 100%;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #222222;\">\n" +
            "    <table cellpadding=\"0\" cellspacing=\"0\" width=\"100%\" style=\"background: white; direction: rtl;\">\n" +
            "        <tbody>\n" +
            "            <tr>\n" +
            "                <td style=\"padding: 0 35px;\">\n" +
// Arabic content
//            "                    <p> انا اسمى عبدالله\n" +
// Arabic content with Unicode escape sequences
            "                    <p>\u0627\u0646\u0627 \u0627\u0633\u0645\u0649 \u0639\u0628\u062F\u0627\u0644\u0644\u0647" +
            "                    </p>\n" +
            "                </td>\n" +
            "            </tr>\n" +
            "        </tbody>\n" +
            "    </table>\n" +
            "\n" +
            "</body>\n" +
            "\n" +
            "</html>";
}

when I use Unicode escape sequences like \u0627\u0646. it is getting printed as it is in PDF — Jitender, May 28 '20 at 12:03
Are you using `\u0627` and not `%u0627` which you mentioned in your question? — rhens, May 28 '20 at 12:27
the arabic text get printed but the translation is wrong. I put انا اسمى عبدالله in code in unicode format as you suggest but it got printed as هللادبع ىمسا انا. I think the text direction is wrong. The unicode charactor I use `\u0627\u0646\u0627 \u0627\u0633\u0645\u0649 \u0639\u0628\u062F\u0627\u0644\u0644\u0647` — Jitender, May 28 '20 at 13:43
Is my last screenshot what you expect? Are you using pdfCalligraph? — rhens, May 28 '20 at 14:58
Thanks, if it is possible can you tell me how to use it in given code as I don't have a prior experience in java. — Jitender, May 28 '20 at 15:40
I tested with your original code without a lot of modifications. Mostly just took out the Spring related parts. But I'll add it to my answer for completeness' sake. — rhens, May 28 '20 at 17:16

score 0 · Answer 3 · answered May 23 '20 at 07:31

0

Make sure your fonts support the characters you need and if you use Maven resource directory to include extra fonts during the build check that the font file is not filtered (properties replacement) as that corrupts the file: Maven corrupting binary files in source/main/resources when building jar

answered May 23 '20 at 07:31

JohannesB

2,214
1
11
18

i use arial font which supports arabic. my pdf has the content but it wasn’t the same as html – Jitender May 23 '20 at 13:37

getting wrong arabic translation in PDF iText

3 Answers3