Web scraping Python program returns "'charmap' codec can't encode character"

Question

import bs4 as bs
import urllib.request
import re
import os
from colorama import Fore, Back, Style, init

init()

def highlight(word):
    if word in keywords:
      return Fore.RED + str(word) + Fore.RESET
    else:
      return str(word)

for newurl in newurls:
 url = urllib.request.urlopen(newurl)
 soup1 = bs.BeautifulSoup(url, 'lxml')
 paragraphs =soup1.findAll('p')
 print (Fore.GREEN + soup1.h2.text + Fore.RESET)
 print('')
 for paragraph in paragraphs:
    if paragraph != None:
        textpara = paragraph.text.strip().split(' ')
        colored_words = list(map(highlight, textpara))
        print(" ".join(colored_words).encode("utf-8")) #encode("utf-8")
    else:
        pass

I will have list of key words and urls to go through. After running few keywords in a url, I get output like this

b'\x1b[31mthe desired \x1b[31mmystery corners \x1b[31mthe differential . 
\x1b[31mthe back \x1b[31mpretends to be \x1b[31mthe'

I removed encode("utf-8") and I get encoding error

Traceback (most recent call last):
 File "C:\Users\resea\Desktop\Python Projects\Try 3.py", line 52, in 
 <module>
   print(" ".join(colored_words)) #encode("utf-8")
  File "C:\Python34\lib\site-packages\colorama\ansitowin32.py", line 41, in 
   write
  self.__convertor.write(text)
   File "C:\Python34\lib\site-packages\colorama\ansitowin32.py", line 162, 
   in write
    self.write_and_convert(text)
   File "C:\Python34\lib\site-packages\colorama\ansitowin32.py", line 190, 
  in write_and_convert
  self.write_plain_text(text, cursor, len(text))
  File "C:\Python34\lib\site-packages\colorama\ansitowin32.py", line 195, in 
   write_plain_text
  self.wrapped.write(text[start:end])
   File "C:\Python34\lib\encodings\cp850.py", line 19, in encode
   return codecs.charmap_encode(input,self.errors,encoding_map)[0]
   UnicodeEncodeError: 'charmap' codec can't encode character '\u2019' in 
   position 23: character maps to <undefined>

Where am I going wrong?

what are the other encoding i can use other than encode("utf-8") — Kapilan Navaratnam, Jan 15 '19 at 16:09
A lot exist. See https://docs.python.org/3/howto/unicode.html — Eb946207, Jan 15 '19 at 16:16

score 0 · Answer 1 · answered Jan 16 '19 at 12:18

0

I know what I'm going to suggest is more of a workaround than a "solution" but I've been frustrated, again and again, by all sorts of strange characters that had to be dealt with "encode this" or "encode that", sometimes successfully and many times not.

Depending on the type of text used in your newurl, the universe of problematic characters is probably limited. So I deal with them on a case-by-case basis: Every time i get one of these errors, I do this:

import unicodedata
unicodedata.name('\u2019')

In your case, you'll get this:

'RIGHT SINGLE QUOTATION MARK'

The old, pesky, right single quotation mark... So next, as suggested here, I just replace that pesky character with another that looks like it, but does not raise the error; in your case

colored_words = list(map(highlight, textpara)).replace(u"\u2019", "'") # or some other replacement character

should work. And you rinse and repeat every time this error pops up. Admittedly, not the most elegant solution, but after a while, all possible strange characters in your newurl are captured and the errors stop.

answered Jan 16 '19 at 12:18

Jack Fleeting

24,385
6
23
45

...or just find the right encoding in the first place and all of your problems will be solved. – Eb946207 Jan 16 '19 at 18:08
I tried to, until one day I ran into [this little guy](https://www.fileformat.info/info/unicode/char/0142/index.htm) which, for some reason, `.encode("utf-8")` couldn't handle, and I just gave up.... – Jack Fleeting Jan 16 '19 at 19:02
1

However, another code would work there. I don’t know enough about codes to be fully sure, but try the encoding `utf-16` or `utf-32`. `utf-64` may also exist, so try that too. – Eb946207 Jan 16 '19 at 19:27
Thanks, will do. BTW, is it possible to apply multiple encodings simultaneously? – Jack Fleeting Jan 16 '19 at 19:30
No. Well, you could encode multiple times, but that would be pointless and bad. Multiple encodings is not done, **ever**. BTW, did it work? – Eb946207 Jan 16 '19 at 21:19
The underlying text (essentially, news stories from many countries) changes on a daily basis so encoding errors show up randomly (and disappear), depending on content every time the urls are reloaded. I'll have to wait until the next pest pops up and then try out your suggestion. Will report back when that happens. – Jack Fleeting Jan 17 '19 at 01:45

Web scraping Python program returns "'charmap' codec can't encode character"

1 Answers1