urllib.urlretrieve encoding is not kept

Question

I'm using python 3.4.

When I use urllib.request.urlretrieve(link, filename="file.html") on a utf-8 file, the resulting file.html is not properly encoded. How do I make sure the file is encoded using utf-8? How to implement the .decode(utf-8) in this case?

EDIT

This is the original part of page:

« Écoute, mon peuple, je parle ;
Moi, Dieu, je suis ton Dieu !
Je ne t'accuse pas pour tes sacrifices ;
tes holocaustes sont toujours devant moi.

« Je ne prendrai pas un seul taureau de ton domaine,
pas un bélier de tes enclos.
Tout le gibier des forêts m'appartient
et le bétail des hauts pâturages.

« Si j'ai faim, irai-je te le dire ?
Le monde et sa richesse m'appartiennent.
Vais-je manger la chair des taureaux
et boire le sang des béliers ?

« Qu'as-tu à réciter mes lois,
à garder mon alliance à la bouche,
toi qui n'aimes pas les reproches
et rejettes loin de toi mes paroles ? »

And this is what I get in the saved file:

� �coute, mon peuple, je parle ;�
Moi, Dieu, je suis ton Dieu !�
Je ne t'accuse pas pour tes sacrifices ;
tes holocaustes sont toujours devant moi.�

� Je ne prendrai pas un seul taureau de ton domaine,
pas un b�lier de tes enclos.�
Tout le gibier des for�ts m'appartient
et le b�tail des hauts p�turages.

� Si j'ai faim, irai-je te le dire ?
Le monde et sa richesse m'appartiennent.�
Vais-je manger la chair des taureaux
et boire le sang des b�liers ?�

� Qu'as-tu � r�citer mes lois,�
� garder mon alliance � la bouche,�
toi qui n'aimes pas les reproches
et rejettes loin de toi mes paroles ?��

I noticed that in certain parts of the page accented characters are not really utf-8 encoded but the browser shows it properly. For example instead of É there is É and when the file is downloaded this seems to cause problems.

I don't think `urlretrieve` re-encodes anything. Can you give an example? — Lev Levitsky, Jun 28 '14 at 09:16
It's called HTML escaping, not encoding. See http://stackoverflow.com/q/2360598/1258041 and http://stackoverflow.com/q/2087370/1258041 — Lev Levitsky, Jun 28 '14 at 09:56

score 1 · Accepted Answer · edited May 23 '17 at 11:49

1

You can unescape the HTML escape sequences in the file line by line using the method shown here.

import html.parser
h = html.parser.HTMLParser()
with urllib.request.urlopen(link) as fin, open(
           "file.html", 'w', encoding='utf-8') as fout:
    for line in fin:
        fout.write(h.unescape(line.decode('utf-8')))

edited May 23 '17 at 11:49

Community

1
1

answered Jun 28 '14 at 10:05

Lev Levitsky

63,701
20
147
175

I tried this code and it does something but not exactly what I needed. Indeed the text is unescaped but the resulting file is saved in windows-1252 encoding, even the lines in the original that were not escaped. – To Do Jun 28 '14 at 12:11
@ToDo Try specifying `encoding` in the call to `open` as shown in the edited answer. – Lev Levitsky Jun 28 '14 at 12:19
That did the trick. I must figure out how to create a function to use this code for each link I download in this script. – To Do Jun 28 '14 at 12:23

score 0 · Answer 2 · answered Jun 28 '14 at 10:30

I advice to use it handle this for you: It convert the loaded document implecitly to utf-8

markup = "<h1>Sacr\xc3\xa9 bleu!</h1>"
soup = BeautifulSoup(markup)
soup.h1
# <h1>Sacré bleu!</h1>
soup.h1.string
# u'Sacr\xe9 bleu!'

BeautifulSoup documentation: here

urllib.urlretrieve encoding is not kept

EDIT

2 Answers2