2

I get above ERRORs when I try to scrape some text with Finish-Names from an 'url'. The solutions I tried and corresponding ERRORs, are commented below in the code. I neither know how to fix these, nor what the exact issue is. I'm a beginner in Python. Any help appreciated.

My Code:

from lxml import html
import requests

page = requests.get('url')

site = page.text  # ERROR -> 'charmap' codec can't encode character '\x84' in  
      #  position {x}: character maps to <undefined>
# site = site.encode('utf-8', errors='replace')  # ERROR -> can't concat str to bytes
# site = site.encode('ascii', errors='replace')  # ERROR -> can't concat str to bytes

with open('url.txt', 'a') as file:
    try:
        file.write(site + '\n')
    except Exception as err:
        file.write('an ERROR occured: ' + str(err) + '\n')

and the original Exception:

Traceback (most recent call last):
  File "...\parse.py", line 12, in <module> 
  file.write(site + '\n') File 
"...\python36\lib\encodings\cp1252.py", line 19, in encode return 
codecs.charmap_encode(input,self.errors,encoding_table)[0] 
UnicodeEncodeError: 'charmap' codec can't encode character '\x84' in position 
12591: character maps to <undefined>

regards, Dominik

Oryon
  • 131
  • 1
  • 3
  • 11
  • Is the exception actually happening on the `page.text` line, or somewhere else? Please post the actual exception, not just a loose description of it. (If it's happening on the `file.write`, you will have to temporarily remove the `try`/`except` to do that.) – abarnert Jul 08 '18 at 09:05
  • Possible duplicate of [UnicodeEncodeError: 'charmap' codec can't encode characters](https://stackoverflow.com/questions/27092833/unicodeencodeerror-charmap-codec-cant-encode-characters) – Tristo Jul 08 '18 at 09:10
  • @abarnert: the original error-message was: `File "...\parse.py", line 12, in file.write(site + '\n') File "...\python36\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\x84' in position 12591: character maps to ` @ Tristo: in the "possible duplicate" I did not find any solution. – Oryon Jul 08 '18 at 10:25

3 Answers3

3

Try this instead

with open('url.txt', 'a',encoding='utf-8') as file:
Tristo
  • 2,328
  • 4
  • 17
  • 28
  • No problem, to mark to anyone in the future reading this with a similar problem you can mark the thread as 'answered' by [accepting the answer](https://stackoverflow.com/help/someone-answers) – Tristo Jul 08 '18 at 11:28
2

If the exception is happening on page.text, as you indicate:

When you ask a requests response for its text, it uses the encoding that the page claims to be in. If the page is wrong, that will fail, and usually raise a UnicodeDecodeError.

For debugging problems like this, you should definitely print out what encoding requests got from the server:

print(page.encoding)

A browser will usually just display mojibake. Sometimes, they'll even realize that the encoding is wrong and try to guess at the encoding. They'll rarely fail and refuse to display anything. That makes sense for something designed to display data immediately. It doesn't make sense for many programs designed to process data, or to store data for later (where you want to know there's a problem ASAP, not after you've stored 500GB of useless garbage), etc. That's why requests doesn't try to hard to do magic.

If you know the encoding is, say, Latin-6/IO-8859-10 even though it claims to be something else, you can decode it manually:

site = page.content.decode('iso-8859-10')

If you don't know, you could use a library like chardet or Unicode, Dammit to do the same kind of guessing a browser does.

If you want to force it to just decode to something that you can later write back out in the same way, even if it's going to look like garbage in the mean time, you can use the surrogate-escape error handler:

site = page.content.decode('utf-8', 'surrogateescape')
# ...
with open('url.txt', 'a', encoding='utf-8', errors='surrogateescape') as file:
    file.write(site + '\n')

However, if you're not actually doing anything with the contents, it's probably easier to just keep it as bytes:

site = page.content
# ...
with open('url.txt', 'ab') as file:
    file.write(site + b'\n')

Notice that 'ab' instead of 'a', and also that b'\n', not '\n'. If you're leaving bytes as bytes, or encoding strings to bytes, you can't write them to text files, only to binary files, and you can't add them to strings, only to other bytes. Those seem to be some of the problems you ran into with some of your fix attempts.

abarnert
  • 354,177
  • 51
  • 601
  • 671
1

I think its happen because of Unicode Transformation.

1.Adding the following line to the top of your .py file:

# -*- coding: utf-8 -*-

OR 2.use str.encode('utf8') function

ex : `site = site.encode('utf8')`
Prashant Godhani
  • 337
  • 1
  • 4
  • 15