10

I am having an issue with Unicode with a variable contents when writing to a .pdf with python.

It's outputting this error:

UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013'

Which is it getting caught on an em dash basically.

I have tried taking that variable, where the contents has an 'em dash' and redefined it with an '.encode('utf-8')' for example, i.e., below:

Body = msg.Body

BodyC = Body.encode('utf-8')

And now I get the below error:

Traceback (most recent call last):
  File "script.py", line 37, in <module>
    pdf.cell(200, 10, txt="Bod: " + BodyC,  ln=4, align="C")
TypeError: can only concatenate str (not "bytes") to str

Below is my full code, how could I simply fix my Unicode error in 'Body' variable contents.

Converting to utf-8 or western, anything outside of 'latin-1'. Any suggestions?

Full Code:

from fpdf import FPDF
import win32com.client

outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
msg = outlook.OpenSharedItem(r"C:\User\language\python\Msg-To-PDF\test_msg.msg")

print (msg.SenderName)
print (msg.SenderEmailAddress)
print (msg.SentOn)
print (msg.To)
print (msg.CC)
print (msg.BCC)
print (msg.Subject)
print (msg.Body)

SenderName = msg.SenderName
SenderEmailAddress = msg.SenderEmailAddress
SentOn = msg.SentOn
To = msg.To
CC = msg.CC
BCC = msg.BCC
Subject = msg.Subject
Body = msg.Body
BodyC = Body.encode('utf-8')

pdf = FPDF()
pdf.add_page()

# pdf.add_font('DejaVu', '', 'DejaVuSansCondensed.ttf', uni=True)
pdf.set_font("Helvetica", style = '', size = 11)
pdf.cell(200, 10, txt="From: " + SenderName, ln=1, align="C")
# pdf.cell(200, 10, border=SentOn, ln=1, align="C")
pdf.cell(200, 10, txt="To: " + To, ln=1, align="C")
pdf.cell(200, 10, txt="CC: " + CC, ln=1, align="C")
pdf.cell(200, 10, txt="BCC: " + BCC, ln=1, align="C")
pdf.cell(200, 10, txt="Subject: " + Subject, ln=1, align="C")
pdf.cell(200, 10, txt="Bod: " + BodyC,  ln=4, align="C")

pdf.output("Sample.pdf")
  • How can I change out of 'latin1'?

  • Anyway to just globally fix these issues?

Dr Upvote
  • 8,023
  • 24
  • 91
  • 204
  • 1
    Have you tried casting the msg.Body with `str(msg.Body)`? – ladygremlin Jun 25 '19 at 20:21
  • Where, what do you mean? – Dr Upvote Jun 25 '19 at 20:23
  • `Body = msg.Body` -> `Body = str(msg.Body)` ? – ladygremlin Jun 25 '19 at 20:24
  • 1
    It still produces the exact same error 'UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 485: ordinal not in range(256)' – Dr Upvote Jun 25 '19 at 20:26
  • Try this answer: https://stackoverflow.com/questions/6539881/python-converting-from-iso-8859-1-latin1-to-utf-8 – ladygremlin Jun 25 '19 at 20:33
  • `BodyC = Body.encode('utf-8')` actually does nothing! Another point is `\u2013` error output is `unicode` but system-wide encoding not set properly. Some warnings: User_class which sub_process is calling with default encoding ? most encoding errors throw by from nonRAW file/IO objects. @ladygremlin whindows always excepting this errors, I solved the system-wide encoding by UTF-8 (not Unicode). – dsgdfg Jun 25 '19 at 20:37
  • 1
    @dsgdfg Ahhh, I didn't realize Windows always throws this. That's not my OS of choice. :) – ladygremlin Jun 25 '19 at 20:38
  • on python idle `'\x64\x45'+'teest' = 'dEteest'` mean i used `python2.7.X` so if use `python3.x` convert bytes to string with source encoding. – dsgdfg Jun 25 '19 at 20:46
  • @dsgdfg Any suggestions? – Dr Upvote Jun 26 '19 at 14:31
  • Possible duplicate of [Python : UnicodeEncodeError: 'latin-1' codec can't encode character](https://stackoverflow.com/questions/8290206/python-unicodeencodeerror-latin-1-codec-cant-encode-character) – phuclv Jun 27 '19 at 00:18
  • [UnicodeEncodeError: 'latin-1' codec can't encode character](https://stackoverflow.com/q/3942888/995714) – phuclv Jun 27 '19 at 00:18
  • @phuclv so I fixed this specific error; but how can I globally handle these issues? – Dr Upvote Jun 27 '19 at 13:13

4 Answers4

21

A workaround is to convert all text to latin-1 encoding before passing it on to the library. You can do that with the following command:

text2 = text.encode('latin-1', 'replace').decode('latin-1')

text2 will be free of any non-latin-1 characters. However, some chars may be replaced with ?

Erik Kalkoken
  • 30,467
  • 8
  • 79
  • 114
5

The reason for this error is that you are trying to render a character in your PDF that is outside the code range of latin-1 encoding. FPDF uses latin-1 as default encoding for all its build-in fonts.

So as a workaround you can just remove all characters from your text that do not fit into latin-1 encoding. (see my other answer for this workaround).

To fix this error and be able to render those characters in your PDF you need to use fonts that support a wider range of characters. To address this the FPDF library supports Unicode font.

For example you could get the free Google Noto fonts, which support a wide range of Unicode endpoints. For most western languages I would recommend the NotoSans font set. But you can also get fonts for many other languages and scripts including Chinese, Hebrew or Arabic.

Here is how to enable the Unicode fonts in your code for FPDF:

First you need to tell FPDF library where it can find the font files. In this example I am setting it to the sub-folder fonts of the current folder.

import fpdf
fpdf.set_global("SYSTEM_TTFONTS", os.path.join(os.path.dirname(__file__),'fonts'))

Then you need to add the fonts to your PDF document. In this example I am adding the NotoSans fonts for the styles normal, bold, italic and bold-italic:

pdf = fpdf.FPDF()
pdf.add_font("NotoSans", style="", fname="NotoSans-Regular.ttf", uni=True)
pdf.add_font("NotoSans", style="B", fname="NotoSans-Bold.ttf", uni=True)
pdf.add_font("NotoSans", style="I", fname="NotoSans-Italic.ttf", uni=True)
pdf.add_font("NotoSans", style="BI", fname="NotoSans-BoldItalic.ttf", uni=True)

Now you can use the new fonts normally in your PDF document with set_font(). Here is an example for normal text:

pdf.set_font("NotoSans", size=12)
Erik Kalkoken
  • 30,467
  • 8
  • 79
  • 114
  • I tried this solution and got this error ```AttributeError: module 'fpdf' has no attribute 'set_global'``` any specific version of fpdf recommended. It gives the error at ```fpdf.set_global...```. I skipped the set_global and gave relative path in ```pdf.add_font(..``` and it works ```pdf.add_font("NotoKufiArabic", style="", fname="./fonts/NotoKufiArabic-Regular.ttf", uni=True)``` – akarahman Sep 08 '21 at 07:01
1

You can also change the encoding through the .set_doc_option() method (documentation here). I tried Erik's method, which worked for me, but then after adding some more complexities (such as a second PDF and using the write_html() method which required creating a new class), I went back to having the same error. Changing the encoding for the whole document should solve the overall problem as you said.

The readthedocs page says you can only use latin-1 or windows-1252, but pdf.set_doc_option('core_fonts_encoding', 'utf-8') worked for me according to the debugger. Just be aware that some characters will need fixing, like the apostrophe (') showing as â€ÂTM in the PDF.

Hope this is the global fix for this issue you were looking for, even if several months late!

0

I was trying Erik's solution with some changes, works great with a mix of English and Arabic text. Sample code posted below to generate PDF using pyFPDF.

from datetime import datetime
def getFileName():
    now=datetime.now()
    time = now.strftime('%d_%H_%M_%S')
    filename = "Test_"+time + ".pdf"
    return filename


from fpdf import FPDF

pdf = FPDF()

#Download NotoSansArabic-Regular.ttf from Google noto fonts
pdf.add_font("NotoSansArabic", style="", fname="./fonts/NotoSansArabic-Regular.ttf", uni=True)


pdf.add_page()

pdf.set_font('Arial', '', 12)
pdf.write(8, 'Hello World')
pdf.ln(8)

# مرحبا Marhaba in arabic 
pdf.set_font('NotoSansArabic', '', 12)
text = 'مرحبا'
pdf.write(8, text)
pdf.ln(8)

pdf.output(getFileName(), 'F')
akarahman
  • 235
  • 2
  • 15