1

I want to write persian text to pdf via PDFBox, however, pdfbox separates and reverse separated charachters. how should i fix that? ( i use pdfbox pdfbox-2.0.19 )

here is my code:

                 PDDocument doc = new PDDocument();
                 PDPage page = new PDPage(PDRectangle.A4);
                 doc.addPage(page);
                 PDPageContentStream cont = new PDPageContentStream(doc, page);
                 cont.beginText();
                 PDType0Font font = PDType0Font.load(doc, new File("C:\\Users\\farhad\\Downloads\\vazir-font-v24.1.0\\Vazir.ttf"));
                 int fontSize = 15;
                 cont.setFont(font, fontSize);
                 String text = "برنامه نویس ایرانی هست اینو بلد باشه؟";
                 float x =  50;
                 float y = page.getMediaBox().getHeight() / 2;
                 cont.newLineAtOffset(x, y);
                 cont.showText(text);
                 cont.endText();
                 cont.close();
                 doc.save("D:\\pdf.pdf");

and I atatched a screenshot which shows the result.

enter image description here

farhad
  • 105
  • 1
  • 9
  • You're printing it left to right. What happens if you reverse the string before printing it? – cup Mar 30 '20 at 21:35
  • @cup Nothing changes. again it separates characters and reverses them – farhad Mar 30 '20 at 21:58
  • Does it print properly if you use printstream? see https://stackoverflow.com/questions/16644247/print-arabic-or-other-charset-in-system-out Just trying to isolate whether it is the pdf driver or whether you need to render before printing. – cup Mar 31 '20 at 04:34
  • Complex scripts are not supported. https://issues.apache.org/jira/browse/PDFBOX-4189 – Tilman Hausherr Mar 31 '20 at 05:24
  • @cup I did some manipulation with Unicode characters and fixed the issue ;) – farhad Apr 02 '20 at 17:13
  • 1
    Post your solution and mark it as the answer. The problem with the character set is there are 4 representations of each letter depending on what's to the left or right of it. There is a unicode character set with all 4 forms. To render, you'd normally check left and right and choose the correct one. That way all the characters will join up. – cup Apr 02 '20 at 19:46
  • @cup exactly ;) – farhad Apr 03 '20 at 11:05

1 Answers1

2

for those of you who seeking the answer of this question:

you should do some manipulation with Unicode characters. Every Persian character that you normally know (for example س ش ت ظ), actually has 4 different forms and each form has its own Unicode characters.

  • initial form
  • medial form
  • final form
  • isolated form

Let's say the word سا

in this case, the initial س has a Unicode character which is different than the س in راس which comes at the end of the word.

for better understanding, look at the picture below

enter image description here

how can you get Unicode characters of each form?

just go to this website https://www.compart.com/en/unicode/ and search your character.

please note that, Arabic and Persian almost use the same script, that's why in the picture that I uploaded, it says "Arabic letter Seen isolated form" for س character.

Here is a class that you can use to calculate 4 different forms of Persian characters:

 public class PersianCharachtersUnicode {

    char c;
    private String InitialFom_Unicode;
    private String MedialForm_Unicode;
    private String FinalForm_Unicode;
    private String IsolatedForm_Unicode;


    public void setCharc (char c) {
        this.c = c;
        calculate();
    }





    private void calculate() {

        switch (c) {

        case 'آ':

            InitialFom_Unicode    = "\0";
            MedialForm_Unicode    = "\0";
            FinalForm_Unicode     = "\0";
            IsolatedForm_Unicode  = "\uFE81";
            break;

        case 'ا':

            InitialFom_Unicode    = "\0";
            MedialForm_Unicode    = "\0";
            FinalForm_Unicode     = "\uFE8E";
            IsolatedForm_Unicode  = "\uFE8D";
            break;


        case 'ب':

            InitialFom_Unicode    = "\uFE91";
            MedialForm_Unicode    = "\uFE92";
            FinalForm_Unicode     = "\uFE90";
            IsolatedForm_Unicode  = "\uFE8F";
            break;


        case 'پ':

            InitialFom_Unicode    = "\uFB58";
            MedialForm_Unicode    = "\uFB59";
            FinalForm_Unicode     = "\uFB57";
            IsolatedForm_Unicode  = "\uFB56";
            break;


        case 'ت':

            InitialFom_Unicode    = "\uFE97";
            MedialForm_Unicode    = "\uFE98";
            FinalForm_Unicode     = "\uFE96";
            IsolatedForm_Unicode  = "\uFE95";
            break;


        case 'ث':

            InitialFom_Unicode    = "\uFE9B";
            MedialForm_Unicode    = "\uFE9C";
            FinalForm_Unicode     = "\uFE9A";
            IsolatedForm_Unicode  = "\uFE99";
            break;


        case 'ج':

            InitialFom_Unicode    = "\uFE9F";
            MedialForm_Unicode    = "\uFEA0";
            FinalForm_Unicode     = "\uFE9E";
            IsolatedForm_Unicode  = "\uFE9D";
            break;


        case 'چ':

            InitialFom_Unicode    = "\uFB7C";
            MedialForm_Unicode    = "\uFB7D";
            FinalForm_Unicode     = "\uFE9B";
            IsolatedForm_Unicode  = "\uFB7A";
            break;


        case 'ح':

            InitialFom_Unicode    = "\uFEA3";
            MedialForm_Unicode    = "\uFEA4";
            FinalForm_Unicode     = "\uFEA2";
            IsolatedForm_Unicode  = "\uFEA1";
            break;


        case 'خ':

            InitialFom_Unicode    = "\uFEA7";
            MedialForm_Unicode    = "\uFEA8";
            FinalForm_Unicode     = "\uFEA6";
            IsolatedForm_Unicode  = "\uFEA5";
            break;


        case 'د':

            InitialFom_Unicode    = "\0";
            MedialForm_Unicode    = "\0";
            FinalForm_Unicode     = "\uFEAA";
            IsolatedForm_Unicode  = "\uFEA9";
            break;


        case 'ذ':

            InitialFom_Unicode    = "\0";
            MedialForm_Unicode    = "\0";
            FinalForm_Unicode     = "\uFEAC";
            IsolatedForm_Unicode  = "\uFEAB";
            break;


        case 'ر':

            InitialFom_Unicode    = "\0";
            MedialForm_Unicode    = "\0";
            FinalForm_Unicode     = "\uFEAE";
            IsolatedForm_Unicode  = "\uFEAD";
            break;


        case 'ز':

            InitialFom_Unicode    = "\0";
            MedialForm_Unicode    = "\0";
            FinalForm_Unicode     = "\uFEB0";
            IsolatedForm_Unicode  = "\uFEAF";
            break;


        case 'ژ':

            InitialFom_Unicode    = "\0";
            MedialForm_Unicode    = "\0";
            FinalForm_Unicode     = "\uFB8B";
            IsolatedForm_Unicode  = "\uFB8A";
            break;


        case 'س':

            InitialFom_Unicode    = "\uFEB3";
            MedialForm_Unicode    = "\uFEB4";
            FinalForm_Unicode     = "\uFEB2";
            IsolatedForm_Unicode  = "\uFEB1";
            break;


        case 'ش':

            InitialFom_Unicode    = "\uFEB7";
            MedialForm_Unicode    = "\uFEB8";
            FinalForm_Unicode     = "\uFEB6";
            IsolatedForm_Unicode  = "\uFEB5";
            break;


        case 'ص':

            InitialFom_Unicode    = "\uFEBB";
            MedialForm_Unicode    = "\uFEBC";
            FinalForm_Unicode     = "\uFEBA";
            IsolatedForm_Unicode  = "\uFEB9";
            break;


        case 'ض':

            InitialFom_Unicode    = "\uFEBF";
            MedialForm_Unicode    = "\uFEC0";
            FinalForm_Unicode     = "\uFEBE";
            IsolatedForm_Unicode  = "\uFEBD";
            break;


        case 'ط':

            InitialFom_Unicode    = "\uFEC3";
            MedialForm_Unicode    = "\uFEC4";
            FinalForm_Unicode     = "\uFEC2";
            IsolatedForm_Unicode  = "\uFEC1";
            break;


        case 'ظ':

            InitialFom_Unicode    = "\uFEC7";
            MedialForm_Unicode    = "\uFEC8";
            FinalForm_Unicode     = "\uFEC6";
            IsolatedForm_Unicode  = "\uFEC5";
            break;


        case 'ع':

            InitialFom_Unicode    = "\uFECB";
            MedialForm_Unicode    = "\uFECC";
            FinalForm_Unicode     = "\uFECA";
            IsolatedForm_Unicode  = "\uFEC9";
            break;


        case 'غ':

            InitialFom_Unicode    = "\uFECF";
            MedialForm_Unicode    = "\uFED0";
            FinalForm_Unicode     = "\uFECE";
            IsolatedForm_Unicode  = "\uFECD";
            break;


        case 'ف':

            InitialFom_Unicode    = "\uFED3";
            MedialForm_Unicode    = "\uFED4";
            FinalForm_Unicode     = "\uFED2";
            IsolatedForm_Unicode  = "\uFED1";
            break;


        case 'ق':

            InitialFom_Unicode    = "\uFED7";
            MedialForm_Unicode    = "\uFED8";
            FinalForm_Unicode     = "\uFED6";
            IsolatedForm_Unicode  = "\uFED5";
            break;


        case 'ک':

            InitialFom_Unicode    = "\uFB90";
            MedialForm_Unicode    = "\uFB91";
            FinalForm_Unicode     = "\uFB8F";
            IsolatedForm_Unicode  = "\uFB8E";
            break;


        case 'گ':

            InitialFom_Unicode    = "\uFB94";
            MedialForm_Unicode    = "\uFB95";
            FinalForm_Unicode     = "\uFB93";
            IsolatedForm_Unicode  = "\uFB92";
            break;


        case 'ل':

            InitialFom_Unicode    = "\uFEDF";
            MedialForm_Unicode    = "\uFEE0";
            FinalForm_Unicode     = "\uFEDE";
            IsolatedForm_Unicode  = "\uFEDD";
            break;


        case 'م':

            InitialFom_Unicode    = "\uFEE3";
            MedialForm_Unicode    = "\uFEE4";
            FinalForm_Unicode     = "\uFEE2";
            IsolatedForm_Unicode  = "\uFEE1";
            break;


        case 'ن':

            InitialFom_Unicode    = "\uFEE7";
            MedialForm_Unicode    = "\uFEE8";
            FinalForm_Unicode     = "\uFEE6";
            IsolatedForm_Unicode  = "\uFEE5";
            break;


        case 'و':

            InitialFom_Unicode    = "\0";
            MedialForm_Unicode    = "\0";
            FinalForm_Unicode     = "\uFEEE";
            IsolatedForm_Unicode  = "\uFEED";
            break;


        case 'ه':

            InitialFom_Unicode    = "\uFEEB";
            MedialForm_Unicode    = "\uFEEC";
            FinalForm_Unicode     = "\uFEEA";
            IsolatedForm_Unicode  = "\uFEE9";
            break;


        case 'ی':

            InitialFom_Unicode    = "\uFBFE";
            MedialForm_Unicode    = "\uFBFF";
            FinalForm_Unicode     = "\uFBFD";
            IsolatedForm_Unicode  = "\uFBFC";
            break;


        default:
            break;
        }

    }




    /**
     * @return the initialFom_Unicode
     */
    public String getInitialFom_Unicode() {
        return InitialFom_Unicode;
    }

    /**
     * @return the finalForm_Unicode
     */
    public String getFinalForm_Unicode() {
        return FinalForm_Unicode;
    }

    /**
     * @return the isolatedForm_Unicode
     */
    public String getIsolatedForm_Unicode() {
        return IsolatedForm_Unicode;
    }

    /**
     * @return the medialForm_Unicode
     */
    public String getMedialForm_Unicode() {
        return MedialForm_Unicode;
    }
 }
farhad
  • 105
  • 1
  • 9
  • [ICU4J](http://site.icu-project.org/home) can help you doing those replacements. See also [this q&a](https://stackoverflow.com/a/48346903/1729265) dealing with the same issue for Arabic writing. – mkl Apr 03 '20 at 13:10
  • @mkl ICU4J uses a class named "ArabicShaping", there is some differences between arabic and persian scripts, for example, these characters exist in persian, but not in arabic. ( گ چ پ ژ ) ArabicShaping doesnt know how to deal with persian chars unfortunately – farhad Apr 03 '20 at 13:14
  • The same problem exists in Urdu and Iraqi – cup Apr 03 '20 at 14:43