2

I'm trying to convert html2pdf from pisa utility. please check the code below. I'm getting error which I couldn't figure out.

Traceback (most recent call last):
  File "dewa.py", line 27, in <module>
    html = html.encode(enc, 'replace')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd9 in position 203: ordinal not in range(128)

Please check code here.

from cStringIO import StringIO
from grab import Grab
from grab.tools.lxml_tools import drop_node, render_html
from grab.tools.text import remove_bom
from lxml import etree
import grab.error
import inspect
import lxml
import os
import sys
import xhtml2pdf.pisa as pisa

enc = 'utf-8'
filePath = '~/Desktop/dewa'
##############################

g = Grab()
g.go('http://www.dewa.gov.ae/arabic/aboutus/dewahistory.aspx')

html = g.response.body

html = html.replace('bgcolor="EDF389"', 'bgcolor="#EDF389"')


''' clear page '''
html = html.encode(enc, 'replace')

print html

f = file(filePath + '.html' , 'wb')
f.write(html)
f.flush()
f.close()

''' Save PDF '''
pdfresult = StringIO()
pdf = pisa.pisaDocument(StringIO(html), pdfresult, encoding = enc)
f = file(filePath + '.pdf', 'wb')
f.write(pdfresult.getvalue())
f.flush()
f.close()
pdfresult.close()
Aruna
  • 701
  • 10
  • 26
  • 1
    A Google search for **'ascii' codec can't decode byte** on Stack Overflow returns 12K+ results. You might want to start with that... – dda Dec 10 '12 at 12:29

1 Answers1

2

If you check the type of object returned by this line:

html = g.response.body

you will see that it is not a unicode object:

print type(html)
...
<type 'str'>

so when you come to this line:

html = html.encode(enc, 'replace')

you are trying to re-encode a string that is already encoded (which causes the error).

To fix this, change your code to look like this:

# decode the dowloaded data
html = g.response.body.decode(enc)

# html is now a unicode object
html = html.replace('bgcolor="EDF389"', 'bgcolor="#EDF389"')

print html

# encode as utf-8 before writing to file (no need for 'replace')
html = html.encode(enc)
ekhumoro
  • 115,249
  • 20
  • 229
  • 336
  • Dear ekhumoro. Thanks for the answer. After I fixed the script as you suggested, the pdf/html file saved cannot read. Please check generated files. – Aruna Dec 11 '12 at 03:26
  • The code I gave is correct, and deals with the encoding issues. I'm guessing you have a different issue with fonts. Are you seeing lots of black rectangles in the pdf file? If so, [this question](http://stackoverflow.com/q/4047095/984421) may help. – ekhumoro Dec 11 '12 at 18:33
  • Dear ekhumoro, thans again. Now I can see the arabic text in pdf. But, all the texts are in reversed order. Any clue? – Aruna Dec 12 '12 at 09:01
  • @ArunaLakmal. I think I've helped as much as I can with this. Please start a new question for any other issues. – ekhumoro Dec 12 '12 at 17:26