trouble using xhtml2pdf with unicode

Question

I've been trying to convert Hebrew html files without success; the Hebrew characters show up in the output PDF as black rectangles regardless of any encoding I tried.

I tried some unicode test files included in the pisa distribution: pisa-3.0.33\test\test-unicode-all.html and \test-bidirectional-text.html . I ran xhtml2pdf from the command line both with and without --encoding utf-8. Same result: none of the non-Latin characters made it through.

Is this a fonts problem*? If the unicode test file works for you, was there anything you did to set it up?

*FWIW, at least some of these languages, including Hebrew, should work with Arial.

EDIT: Alternatively, if someone has pisa set up and could try converting the unicode test file above, I would be very grateful.

Yep. I also tried with Windows-1255 HTML (and used --encoding windows-1255 in that case). — user490616, Oct 28 '10 at 22:08

score 8 · Answer 1 · answered Jan 29 '11 at 16:22

8

Inserting following code into html helped me

<style>
@page {
size: a4;
margin: 0.5cm;
}

@font-face {
font-family: "Verdana";
src: url("verdana.ttf");
}

html {
font-family: Verdana;
font-size: 11pt;
}

</style>

in url instead of "verdana.ttf" you should put absolute path to font in your os

answered Jan 29 '11 at 16:22

eviltrue

679
8
9

note that the url() path should be relative to your project root (in my experience) – Steve Jalim Jan 26 '12 at 20:03

score 3 · Answer 2 · edited May 23 '17 at 12:33

If anyone in the future tries, like me, to figure out how to PROPERLY create a PDF file that contains Hebrew using xhtml2pdf, here's what worked for me:

First thing: including the fonts settings as described here by @eviltrue in my HTML. This can be any font as long as it supports Hebrew characters, otherwise any Hebrew characters in the input HTML would simply appear as black rectangles in the PDF.
At the time of writing this answer, while it is possible to output Hebrew characters to PDF in xhtml2pdf, Hebrew characters are outputted in revers order, i.e. שלום כיתה א
would be א התיכ םולש.

At this point I was stuck, but then I stumbled upon this SO asnwer: https://stackoverflow.com/a/15449145/1918837

After installing the python-bidi package, here is an example of a complete solution (used in a python app):

from bidi import algorithm as bidialg
from xhtml2pdf import pisa

HTMLINPUT = """
            <!DOCTYPE html>
            <html>
            <head>
               <meta http-equiv="content-type" content="text/html; charset=utf-8">
               <style>
                  @page {
                      size: a4;
                      margin: 1cm;
                  }

                  @font-face {
                      font-family: DejaVu;
                      src: url(my_fonts_dir/DejaVuSans.ttf);
                  }

                  html {
                      font-family: DejaVu;
                      font-size: 11pt;
                  }
               </style>
            </head>
            <body>
               <div>Something in English - משהו בעברית</div>
            </body>
            </html>
            """

pdf = pisa.CreatePDF(bidialg.get_display(HTMLINPUT, base_dir="L"), outpufile)

# I'm using base_dir="L" so that "< >" signs in HTML tags wouldn't be
flipped by the bidi algorithm

The nice thing about the bidi algorithm is that you can have mixed RTL and LTR languages in the same line (like in the HTML example above) and still have a correctly formatted result.

EDIT: The best way to go now is definitely using wkhtmltopdf

how do i add a table of contents using wkpdftohtml? thanks! – yishairasowsky Aug 21 '21 at 20:12 — yishairasowsky, Aug 21 '21 at 20:12
please can you help me? – yishairasowsky Sep 12 '21 at 11:58 — yishairasowsky, Sep 12 '21 at 11:58

trouble using xhtml2pdf with unicode

2 Answers2

Linked