10

I'm trying to generate a PDF that contains Arabic text using PDFBox Apache but the text is generated as separated characters because Apache parses given Arabic string to a sequence of general 'official' Unicode characters that is equivalent to the isolated form of Arabic characters.

Here is an example:
Target text to Write in PDF "Should be expected output in PDF File" -> جملة بالعربي
What I get in PDF File ->

incorrect text

I tried some methods but it's no use here are some of them:
1. Converting String to Stream of bits and trying to extract right values
2. Treating String a sequence of bytes with UTF-8 && UTF-16 and extracting values from them

There is some approach seems very promising to get the value "Unicode" of each character But it generate general "official Unicode" Here is what I mean

System.out.println( Integer.toHexString( (int)(new String("كلمة").charAt(1))) );  

output is 644 but fee0 was the expected output because this character is in middle from then I should get the middle Unicode fee0

so what I want is some method that generates the correct Unicode not the just the official one

The very Left column in the first table in the following link represents the general Unicode
Arabic Unicode Tables Wikipedia

Mina Gerges
  • 295
  • 2
  • 14
  • Have you tried `codePointAt`? – M. Prokhorov Jan 16 '18 at 15:50
  • @M.Prokhorov It is the same of output of charAt (644) and fee0 should be the expected output according to character position and form – Mina Gerges Jan 16 '18 at 15:55
  • 1
    [Here's something](https://github.com/w3c/alreq/wiki/Should-I-use-the-Arabic-Presentation-Forms-provided-in-Unicode%3F) to be said about char presentation forms. Apart from that, the only thing I could suggest then is to try using ICU library. Maybe that could help. – M. Prokhorov Jan 16 '18 at 16:31
  • Thanks for your help .Have a new Day ! – Mina Gerges Jan 16 '18 at 16:57
  • 1
    There was a mention on the pdfbox user mailing list (I think november or december 2017) of a person who had managed to do it. – Tilman Hausherr Jan 17 '18 at 09:23
  • @TilmanHausherr Any chance to put hands on what that person did ? – Mina Gerges Jan 17 '18 at 14:02
  • 1
    It is here https://mail-archives.apache.org/mod_mbox/pdfbox-users/201712.mbox/browser in the thread "Disconnected arabic characters". – Tilman Hausherr Jan 17 '18 at 14:07
  • It Worked ! Much Thanks. Can you put a detailed Answer or you want me to take care of that ? Here is the Line Writer.showText(newStringBuilder(newArabicShaping(ArabicShaping.LETTERS_SHAPE).shape(target)).reverse().toString()); .\n target is the String and Writer is PDPageContentStream – Mina Gerges Jan 17 '18 at 17:01
  • Thanks @M.Prokhorov for you too at first i thought that ICU library was for C++ – Mina Gerges Jan 17 '18 at 17:16
  • @MinaGerges please do it. – Tilman Hausherr Jan 19 '18 at 09:41
  • @TilmanHausherr It's Done – Mina Gerges Jan 19 '18 at 22:59
  • @TilmanHausherr Can you tell me how to make PDPageContentStream object text flow direction right to left instead of LTR – Mina Gerges Feb 06 '18 at 18:13
  • 1
    There isn't AFAIK. I thought that the solution I pointed to made this all appear nicely? – Tilman Hausherr Feb 06 '18 at 18:18
  • yea solution was perfect but i want when i call showtext method text flows from right to left because i having hard time trying make Arabic lines start at the same vertical line – Mina Gerges Feb 06 '18 at 18:39
  • @TilmanHausherr is there any accurate way to measure string width ? – Mina Gerges Feb 06 '18 at 22:05
  • @MinaGerges *"The bounty expires in 6 days. Answers to this question are eligible for a +100 reputation bounty. Mina Gerges wants to reward an existing answer."* - I don't understand this bounty. The only answer is by yourself after all... – mkl Feb 18 '20 at 11:45
  • I was playing around the bounty system to understand how it works. so when i pressed the final button i expected a modal to appear saying "are you sure ?" And Here We Are .. – Mina Gerges Feb 18 '20 at 13:05

2 Answers2

9

Notice:

The sample code in this answer might be outdated please refer to h q's answer for the working sample code


At First I will thank Tilman Hausherr and M.Prokhorov for showing me the library that made writing Arabic possible using PDFBox Apache.

This Answer will be divided into two Sections:
  1. Downloading the library and installing it
  2. How to use the library

Downloading the library and installing it

We are going to use ICU Library.
ICU stands for International Components for Unicode and it is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.

To download the Library go to the downloads page from here.
Choose the latest version of ICU4J as shown in the following image.
Downloads Page
You will be transferred to another page and you will find a box with direct links of the needed components .Go ahead and download three Files you will find the highlighted in next image.

  1. icu4j-docs.jar
  2. icu4j-src.jar
  3. icu4j.jar

Files

The following explanation for creating and adding a library in Netbeans IDE

  1. Navigate to the Toolbar and Click tools
  2. Choose Libraries
  3. At the bottom left you will find new Library button Create yours
  4. Navigate to the library that you created in libraries list
  5. Click it and add jar folders like that
  6. Add icu4j.jar in class path
  7. Add icu4j-src.jar in Sources
  8. Add icu4j-docs.jar in Javadoc
  9. View your opened projects from the very right
  10. Expand the project that you want to use the library in
  11. Right Click on the libraries folder and choose add library
  12. Finally choose the library that you had just created.

Now you are ready to use the library just import what you want like that

import com.ibm.icu.What_You_Want_To_Import;


How to use the library

With ArabicShaping Class and reversing the String we can write a correct attached Arabic LINE
Here is the Code Notice the comments in the following code

import com.ibm.icu.text.ArabicShaping;
import com.ibm.icu.text.ArabicShapingException;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.*;

public class Main {
    public static void main(String[] args) throws IOException , ArabicShapingException
{
        File f = new File("Arabic Font File of format.ttf");
        PDDocument doc = new PDDocument();
        PDPage Page = new PDPage();
        doc.addPage(Page);
        PDPageContentStream Writer = new PDPageContentStream(doc, Page);
        Writer.beginText();
        Writer.setFont(PDType0Font.load(doc, f), 20);
        Writer.newLineAtOffset(0, 700);
        //The Trick in the next Line of Code But Here is some few Notes first  
        //We have to reverse the string because PDFBox is Writting from the left but Arabic is RTL Language  
        //The output will be perfect except every line will be justified to the left "It's not hard to resolve this"
        // So we have to write arabic string to pdf line by line..It will be like this
        String s ="جملة بالعربي لتجربة الكلاس اللذي يساعد علي وصل الحروف بشكل صحيح";
        Writer.showText(new StringBuilder(new ArabicShaping(reverseNumbersInString(ArabicShaping.LETTERS_SHAPE).shape(s))).reverse().toString());
        // Note the previous line of code throws ArabicShapingExcpetion 
        Writer.endText();
        Writer.close();
        doc.save(new File("File_Test.pdf"));
        doc.close();
    }
}

Here is the output

Output

I hope that I had gone over everything.

Update : After reversing make sure to reverse the numbers again in order to get the same proper number
Here is a couple of functions that could help

public static boolean isInt(String Input)
{
    try{Integer.parseInt(Input);return true;}
    catch(NumberFormatException e){return false;}
}
public static String reverseNumbersInString(String Input)
{
    char[] Separated = Input.toCharArray();int i = 0;
    String Result = "",Hold = "";
    for(;i<Separated.length;i++ )
    {
        if(isInt(Separated[i]+"") == true)
        {
            while(i < Separated.length && (isInt(Separated[i]+"") == true ||  Separated[i] == '.' ||  Separated[i] == '-'))
            {
                Hold += Separated[i];
                i++;
            }
            Result+=reverse(Hold);
            Hold="";
        }
        else{Result+=Separated[i];}
    }
    return Result;
}
Mina Gerges
  • 295
  • 2
  • 14
  • 2
    for those who build with maven... the segment in the pom.xml would be ` com.ibm.icu icu4j 60.2 `, to see the formatting see here http://mvnrepository.com/artifact/com.ibm.icu/icu4j/60.2 . – Tilman Hausherr Jan 20 '18 at 06:20
  • And here is some explanation about maven and why you should use it https://stackoverflow.com/questions/13335351/what-does-maven-do-in-theory-and-in-practice-when-is-it-worth-to-use-it – Mina Gerges Jan 20 '18 at 07:51
  • 1
    If you want a library that does all this for you(bidi, shaping,reversing,etc) try OpenHTMLtoPDF. https://github.com/danfickle/openhtmltopdf – Daniel F Mar 29 '18 at 14:51
  • Works fine in case of new pdfs, when I am trying to update the pdf, it doesn't work. Here is an example https://stackoverflow.com/questions/55451551/unable-ot-save-arabic-words-in-a- pdf-pdfbox-java – Danyal Sandeelo Apr 01 '19 at 13:05
  • i am still novice with library but what about using the content writer in the append mode ? – Mina Gerges Apr 02 '19 at 12:34
5

Here is a code that works. Download a sample font, e.g. trado.ttf

EDIT: I have since been using the Amiri font, which can be downloaded from the aliftype/amiri Github repository.

Make sure the pdfbox-app and icu4j jar files are in your classpath.

import java.io.File;
import java.io.IOException;

import com.ibm.icu.text.ArabicShaping;
import com.ibm.icu.text.ArabicShapingException;
import com.ibm.icu.text.Bidi;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.*;

public class Main {
    public static void main(String[] args) throws IOException , ArabicShapingException
    {
    File f = new File("Amiri-Regular.ttf");
        PDDocument doc = new PDDocument();
        PDPage Page = new PDPage();
        doc.addPage(Page);
        PDPageContentStream Writer = new PDPageContentStream(doc, Page);
        Writer.beginText();
        Writer.setFont(PDType0Font.load(doc, f), 20);
        Writer.newLineAtOffset(0, 700);
        String s ="جملة بالعربي لتجربة الكلاس اللذي يساعد علي وصل الحروف بشكل صحيح";
        Writer.showText(bidiReorder(s));
        Writer.endText();
        Writer.close();
        doc.save(new File("File_Test.pdf"));
        doc.close();
    }

    private static String bidiReorder(String text)
    {
        try {
        Bidi bidi = new Bidi((new ArabicShaping(ArabicShaping.LETTERS_SHAPE)).shape(text), 127);
            bidi.setReorderingMode(0);
            return bidi.writeReordered(2);
        }
        catch (ArabicShapingException ase3) {
        return text;
    }
    }
    
}
h q
  • 1,168
  • 2
  • 10
  • 23