2

I am generating PDF file from my HTML string, But when PDF file getting generated the content in HTML and PDF does not match. The content is PDF is some random content. I read about the issue on google and they suggest using Unicode notation like %u0627%u0646%u0627%20%u0627%u0633%u0645%u0649%20%u0639%u0628%u062F%u0627%u0644%u0644%u0647. But I am putting this into my HTML it is getting printing as it is.

related issue: Writing Arabic in pdf using itext

package com.example.demo;

import com.itextpdf.html2pdf.ConverterProperties;
import com.itextpdf.html2pdf.HtmlConverter;
import com.itextpdf.styledxmlparser.css.media.MediaDeviceDescription;
import com.itextpdf.styledxmlparser.css.media.MediaType;
import com.itextpdf.html2pdf.resolver.font.DefaultFontProvider;
import com.itextpdf.layout.font.FontProvider;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;

@SpringBootApplication
public class DemoApplication {

    public static void main(String[] args) throws IOException {
        SpringApplication.run(DemoApplication.class, args);
        String htmlSource = getContent();
        ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
        ConverterProperties converterProperties = new ConverterProperties();
        FontProvider dfp = new DefaultFontProvider(true, false, false);
        dfp.addFont("/Library/Fonts/Arial.ttf");
        converterProperties.setFontProvider(dfp);
        converterProperties.setMediaDeviceDescription(new MediaDeviceDescription(MediaType.PRINT));
        HtmlConverter.convertToPdf(htmlSource, outputStream, converterProperties);
        byte[] bytes = outputStream.toByteArray();
        File pdfFile = new File("java19.pdf");
        FileOutputStream fos = new FileOutputStream(pdfFile);
        fos.write(bytes);
        fos.flush();
        fos.close();
    }

    private static String getContent() {
        return "<!DOCTYPE html>\n" +
                "<html lang=\"en\">\n" +
                "\n" +
                "<head>\n" +
                "    <meta charset=\"UTF-8\">\n" +
                "    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n" +
                "    <meta http-equiv=\"X-UA-Compatible\" content=\"ie=edge\">\n" +
                "    <title>Document</title>\n" +
                "    <style>\n" +
                "      @page {\n" +
                "        margin: 0;\n" +
                "        font-family: arial;\n" +
                "      }\n" +
                "    </style>\n" +
                "</head>\n" +
                "\n" +
                "<body\n" +
                "    style=\"margin: 0;padding: 0;font-family: arial, sans-serif;font-size: 14px;line-height: 125%;width: 100%;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #222222;\">\n" +
                "    <table cellpadding=\"0\" cellspacing=\"0\" width=\"100%\" style=\"background: white; direction: rtl;\">\n" +
                "        <tbody>\n" +
                "            <tr>\n" +
                "                <td style=\"padding: 0 35px;\">\n" +
                "                    <p> انا اسمى عبدالله\n" +
                "                    </p>\n" +
                "                </td>\n" +
                "            </tr>\n" +
                "        </tbody>\n" +
                "    </table>\n" +
                "\n" +
                "</body>\n" +
                "\n" +
                "</html>";
    }
}
Jitender
  • 7,593
  • 30
  • 104
  • 210

3 Answers3

1

Please check to make sure that your sourcefile and compiler use the same encoding, e.g. UTF-8. I sometimes check that by including characters that are only available in unicode and not in other classic codepages.

I tried to reproduce the issue and I got the following warning in the logging when running the example code:

Cannot find pdfCalligraph module, which was implicitly required by one of the layout properties

This was already mentioned by Alexsey Subach and can cause the following issue:

This is the output I got without pdfCalligraph:

pdf result without calligraph

Created with the codebase on this repository

So in order to get everything to work perfectly like your browser does with the HTML for Arabic you will also need:

Your question is tagged as regarding iText7 but there may be other possible free alternatives depending on your requirements like Apache FOP that should work with Arabic Ligatures according to this source but probably require rework as it is based on XSL-FO. In theory you could generate the XSL-FO with any templating mechanism that you currently use e.g.: JSP/JSF/Thymeleaf etc. and use something like a ServletFilter to convert the XSL-FO to a PDF on the fly during a request (in a web application)

JohannesB
  • 2,214
  • 1
  • 11
  • 18
1

It's difficult to determine what the issue is exactly without seeing the faulty output. But your "random content" sounds like an encoding issue.

Since you have your Arabic content directly in your source code, you have to be careful about encoding. For example, using ISO-8859-1, the resulting PDF output is:

Wrong encoding

Using Unicode escape sequences (\uXXXX), you can indeed avoid some of these encoding issues. Replacing

"                    <p> انا اسمى عبدالله\n" +

with

"                    <p>\u0627\u0646\u0627 \u0627\u0633\u0645\u0649 \u0639\u0628\u062F\u0627\u0644\u0644" +

results in Arabic glyphs, even when using ISO-8859-1 encoding. Alternatively, you can use UTF-8 to get the correct content regardless of the use of Unicode escape sequences.

When your encoding issues are solved, you will likely get output like this:

Incorrect Arabic rendering

For correct rendering of certain writing systems, an optional module pdfCalligraph is needed for iText 7. With this module enabled, the resulting output looks like this:

Arabic rendering output

The code used for the tests above:

public static void main(String[] args) throws IOException {
    // Needed for pdfCalligraph
    LicenseKey.loadLicenseFile("all-products.xml");

    File pdfFile = new File("java19.pdf");
    OutputStream outputStream = new FileOutputStream(pdfFile);
    String htmlSource = getContent();
    ConverterProperties converterProperties = new ConverterProperties();
    FontProvider dfp = new DefaultFontProvider(true, false, false);
    dfp.addFont("/Library/Fonts/Arial.ttf");
    converterProperties.setFontProvider(dfp);
    converterProperties.setMediaDeviceDescription(new MediaDeviceDescription(MediaType.PRINT));
    HtmlConverter.convertToPdf(htmlSource, outputStream, converterProperties);
}

private static String getContent() {
    return "<!DOCTYPE html>\n" +
            "<html lang=\"en\">\n" +
            "\n" +
            "<head>\n" +
            "    <meta charset=\"UTF-8\">\n" +
            "    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n" +
            "    <meta http-equiv=\"X-UA-Compatible\" content=\"ie=edge\">\n" +
            "    <title>Document</title>\n" +
            "    <style>\n" +
            "      @page {\n" +
            "        margin: 0;\n" +
            "        font-family: arial;\n" +
            "      }\n" +
            "    </style>\n" +
            "</head>\n" +
            "\n" +
            "<body\n" +
            "    style=\"margin: 0;padding: 0;font-family: arial, sans-serif;font-size: 14px;line-height: 125%;width: 100%;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #222222;\">\n" +
            "    <table cellpadding=\"0\" cellspacing=\"0\" width=\"100%\" style=\"background: white; direction: rtl;\">\n" +
            "        <tbody>\n" +
            "            <tr>\n" +
            "                <td style=\"padding: 0 35px;\">\n" +
// Arabic content
//            "                    <p> انا اسمى عبدالله\n" +
// Arabic content with Unicode escape sequences
            "                    <p>\u0627\u0646\u0627 \u0627\u0633\u0645\u0649 \u0639\u0628\u062F\u0627\u0644\u0644\u0647" +
            "                    </p>\n" +
            "                </td>\n" +
            "            </tr>\n" +
            "        </tbody>\n" +
            "    </table>\n" +
            "\n" +
            "</body>\n" +
            "\n" +
            "</html>";
}
rhens
  • 4,791
  • 3
  • 22
  • 38
  • when I use Unicode escape sequences like \u0627\u0646. it is getting printed as it is in PDF – Jitender May 28 '20 at 12:03
  • Are you using `\u0627` and not `%u0627` which you mentioned in your question? – rhens May 28 '20 at 12:27
  • the arabic text get printed but the translation is wrong. I put انا اسمى عبدالله in code in unicode format as you suggest but it got printed as هللادبع ىمسا انا. I think the text direction is wrong. The unicode charactor I use `\u0627\u0646\u0627 \u0627\u0633\u0645\u0649 \u0639\u0628\u062F\u0627\u0644\u0644\u0647` – Jitender May 28 '20 at 13:43
  • Is my last screenshot what you expect? Are you using pdfCalligraph? – rhens May 28 '20 at 14:58
  • Thanks, if it is possible can you tell me how to use it in given code as I don't have a prior experience in java. – Jitender May 28 '20 at 15:40
  • I tested with your original code without a lot of modifications. Mostly just took out the Spring related parts. But I'll add it to my answer for completeness' sake. – rhens May 28 '20 at 17:16
0

Make sure your fonts support the characters you need and if you use Maven resource directory to include extra fonts during the build check that the font file is not filtered (properties replacement) as that corrupts the file: Maven corrupting binary files in source/main/resources when building jar

JohannesB
  • 2,214
  • 1
  • 11
  • 18