Translating multi-byte characters into 7-bit ASCII in Python

Question

I'm downloading and parsing a web page via a Python script. I need it to be encoded into 7-bit ASCII for further processing. I am using the requests library (http://docs.python-requests.org/en/master/) in a virtualenv based upon whatever Ubuntu 16.04 LTS has.

I would like the requests package, or some package, to handle the translation into ASCII, without requiring me to do further translation of encoded characters, because I know I am going to miss some characters. Details are as follows:

My current Python script, shown below, uses an encoding of ISO-8859-1 in an attempt to force the result data to be converted to 7-bit ASCII, with some partial success. But, I have set the result encoding and also encode the text when it comes out. That seems odd, and in fact, downright wrong. But even if I live with that, I have the main issue which is as follows:

Even after the encoding, I see dashes encoded in what seems to be in some non-ASCII character set. It is as if the dash characters slipped through the requests encoding. The script below hacks around this by searching for and replacing the multi-byte dash encoding with an ASCII dash character. This is not a big deal if it is one multi-byte character, but suspect that there are other characters that will need to be translated in other web pages I wish to process. Do I simply need to use some other encoding other than 'ISO-8859-1' with the requests object?

Here is my script (using Python 2.7.11 on Ubuntu 16.04 LTS on x86_64):

 #!/bin/bash

 import sys
 import os
 import string
 import re
 import requests

 url = "https://system76.com/laptops/kudu"

 r = requests.get(url)

 #
 # Why do I have to BOTH set r.encoding AND call r.text.encode
 # in order to avoid the errors?:
 #
 encoding = 'ISO-8859-1'
 r.encoding = encoding
 data = r.text.encode(encoding)

 #
 # Split the lines out, find the offending line,
 # and translate the multi-byte characters:
 #
 lines = data.splitlines()
 for line in lines:
     m = re.search(r'2.6 up to 3.5 GHz', line)
     if m:
         print "line:      {}".format(line)
         m = re.search(r'\xe2\x80\x93', line)
         # The '-' in the next line is a ASCII dash character:
         fixed_line = re.sub(r'\xe2\x80\x93', '-', line)
         print "fixed_line {}".format(line)

Invoking simple_wget.py within the virtualenv shows:

theuser@thesystem:~$ simple_wget.py
line:                           <td>2.6 up to 3.5 GHz – 6 MB cache – 4 cores – 8 threads</td>
fixed_line                      <td>2.6 up to 3.5 GHz - 6 MB cache - 4 cores - 8 threads</td>

Passing that output through oc -cb to see the octal values ("342 200 223") of the dash characters corresponding to the r'\xe2\x80\x93' in the script above:

theuser@thesystem:~$ simple_wget.py | od -cb
0000000   l   i   n   e   :                          \t  \t  \t  \t  \t
        154 151 156 145 072 040 040 040 040 040 040 011 011 011 011 011
0000020  \t   <   t   d   >   2   .   6       u   p       t   o       3
        011 074 164 144 076 062 056 066 040 165 160 040 164 157 040 063
0000040   .   5       G   H   z     342 200 223       6       M   B    
        056 065 040 107 110 172 040 342 200 223 040 066 040 115 102 040
0000060   c   a   c   h   e     342 200 223       4       c   o   r   e
        143 141 143 150 145 040 342 200 223 040 064 040 143 157 162 145
0000100   s     342 200 223       8       t   h   r   e   a   d   s   <
        163 040 342 200 223 040 070 040 164 150 162 145 141 144 163 074
0000120   /   t   d   >  \n   f   i   x   e   d   _   l   i   n   e    
        057 164 144 076 012 146 151 170 145 144 137 154 151 156 145 040
0000140  \t  \t  \t  \t  \t  \t   <   t   d   >   2   .   6       u   p
        011 011 011 011 011 011 074 164 144 076 062 056 066 040 165 160
0000160       t   o       3   .   5       G   H   z       -       6    
        040 164 157 040 063 056 065 040 107 110 172 040 055 040 066 040
0000200   M   B       c   a   c   h   e       -       4       c   o   r
        115 102 040 143 141 143 150 145 040 055 040 064 040 143 157 162
0000220   e   s       -       8       t   h   r   e   a   d   s   <   /
        145 163 040 055 040 070 040 164 150 162 145 141 144 163 074 057
0000240   t   d   >  \n
        164 144 076 012
0000244
theuser@thesystem:~$

Things I've tried:

https://stackoverflow.com/a/19645137/257924 implies using an encoding of ascii, but it chokes inside the requests library. Changing the script to be:

#encoding = 'ISO-8859-1'
encoding = 'ascii' # try https://stackoverflow.com/a/19645137/257924
r.encoding = encoding
data = r.text.encode(encoding)

yields:

theuser@thesystem:~$ ./simple_wget
Traceback (most recent call last):
  File "./simple_wget.py", line 18, in <module>
    data = r.text.encode(encoding)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10166-10168: ordinal not in range(128)

Changing the last line above to be

data = r.text.encode(encoding, "ignore")

results in the dashes just being removed, not translated which is not what I want.

And this also does not work at all:

encoding = 'ISO-8859-1'
r.encoding = encoding
data = r.text.encode(encoding)

charmap = {
    0x2014: u'-',   # em dash
    0x201D: u'"',   # comma quotation mark, double
    # etc.
}
data = data.translate(charmap)

because it gives this error:

Traceback (most recent call last):
  File "./simple_wget.py", line 30, in <module>
    data = tmp2.translate(charmap)
TypeError: expected a string or other character buffer object

which is, as far as I can understand from https://stackoverflow.com/a/10385520/257924, due to "data" not being a unicode string. A 256-character translation table is not going to do what I need anyhow. And besides that is overkill: something inside Python should translate these multi-byte characters without requiring hack code at my script level.

By the way, I'm not interested in multi-lingual page translation. All pages translated are expected to be in US or British English.

US or British English may contain non-ascii characters, like dashes — –, quotes “ ”, apostrophes ’, and others © ® … Maybe you'll find some emoji ☹️. Some texts use † for footnotes. If you can at all avoid it, do not try to reduce text to ASCII, even if it's English. — roeland, Jul 04 '16 at 02:54
I may be forced into avoiding it as you said. Now that http://stackoverflow.com/a/38178064/257924 has set me straight on the proper usage, I'll know what I need to ask for in a followup, separate, question. — bgoodr, Jul 07 '16 at 15:29

score 1 · Accepted Answer · answered Jul 04 '16 at 06:24

Python has everything you need to cleanly process non ASCII characters... provided you declare the proper encoding. Your input file is UTF8 encoded, not ISO-8859-1, because r'\xe2\x80\x93' is the UTF8 encoding for the EN DASH character or unicode U+2013.

So you should:

load the text from request as a true unicode string:

url = "https://system76.com/laptops/kudu"

r = requests.get(url)
r.encoding = "UTF-8"
data = r.text  # ok, data is a true unicode string

translate offending characters in unicode:
```
charmap = {
    0x2014: u'-',   # em dash
    0x201D: u'"',   # comma quotation mark, double
    # etc.
}
data = data.translate(charmap)
```
It will work now, because the translate map is different for byte and unicode strings. For byte strings, the translation table must be a string of length 256, whereas for unicode strings it must be a mapping of Unicode ordinals to Unicode ordinals, Unicode strings or None (ref: Python Standard Library Reference Manual).
then you can safely encode data to an ascii byte string:
```
tdata = data.encode('ascii')
```
The above command will throw exception if some untranslated non ascii characters remains in the data unicode string. You can see that as a help to be sure that everything as been successfully converted.

That was quite helpful. I realize I need a new question to clarify my main intent which is to completely avoid hardcoding a charmap into my code by using standard, or PyPy-provided, python libraries to provide that charmap out of the box. Thus marking this as answered. — bgoodr, Jul 07 '16 at 15:31
Posted http://stackoverflow.com/questions/38249708/python-library-to-translate-multi-byte-characters-into-7-bit-ascii-in-python as the new separate question — bgoodr, Jul 07 '16 at 15:38

Translating multi-byte characters into 7-bit ASCII in Python

1 Answers1

Linked