I'm downloading and parsing a web page via a Python script. I need it to be encoded into 7-bit ASCII for further processing. I am using the requests library (http://docs.python-requests.org/en/master/) in a virtualenv based upon whatever Ubuntu 16.04 LTS has.
I would like the requests package, or some package, to handle the translation into ASCII, without requiring me to do further translation of encoded characters, because I know I am going to miss some characters. Details are as follows:
My current Python script, shown below, uses an encoding of ISO-8859-1 in an attempt to force the result data to be converted to 7-bit ASCII, with some partial success. But, I have set the result encoding and also encode the text when it comes out. That seems odd, and in fact, downright wrong. But even if I live with that, I have the main issue which is as follows:
Even after the encoding, I see dashes encoded in what seems to be in some non-ASCII character set. It is as if the dash characters slipped through the requests encoding. The script below hacks around this by searching for and replacing the multi-byte dash encoding with an ASCII dash character. This is not a big deal if it is one multi-byte character, but suspect that there are other characters that will need to be translated in other web pages I wish to process. Do I simply need to use some other encoding other than 'ISO-8859-1' with the requests object?
Here is my script (using Python 2.7.11 on Ubuntu 16.04 LTS on x86_64):
#!/bin/bash
import sys
import os
import string
import re
import requests
url = "https://system76.com/laptops/kudu"
r = requests.get(url)
#
# Why do I have to BOTH set r.encoding AND call r.text.encode
# in order to avoid the errors?:
#
encoding = 'ISO-8859-1'
r.encoding = encoding
data = r.text.encode(encoding)
#
# Split the lines out, find the offending line,
# and translate the multi-byte characters:
#
lines = data.splitlines()
for line in lines:
m = re.search(r'2.6 up to 3.5 GHz', line)
if m:
print "line: {}".format(line)
m = re.search(r'\xe2\x80\x93', line)
# The '-' in the next line is a ASCII dash character:
fixed_line = re.sub(r'\xe2\x80\x93', '-', line)
print "fixed_line {}".format(line)
Invoking simple_wget.py within the virtualenv shows:
theuser@thesystem:~$ simple_wget.py
line: <td>2.6 up to 3.5 GHz – 6 MB cache – 4 cores – 8 threads</td>
fixed_line <td>2.6 up to 3.5 GHz - 6 MB cache - 4 cores - 8 threads</td>
Passing that output through oc -cb
to see the octal values ("342 200
223") of the dash characters corresponding to the r'\xe2\x80\x93'
in
the script above:
theuser@thesystem:~$ simple_wget.py | od -cb
0000000 l i n e : \t \t \t \t \t
154 151 156 145 072 040 040 040 040 040 040 011 011 011 011 011
0000020 \t < t d > 2 . 6 u p t o 3
011 074 164 144 076 062 056 066 040 165 160 040 164 157 040 063
0000040 . 5 G H z 342 200 223 6 M B
056 065 040 107 110 172 040 342 200 223 040 066 040 115 102 040
0000060 c a c h e 342 200 223 4 c o r e
143 141 143 150 145 040 342 200 223 040 064 040 143 157 162 145
0000100 s 342 200 223 8 t h r e a d s <
163 040 342 200 223 040 070 040 164 150 162 145 141 144 163 074
0000120 / t d > \n f i x e d _ l i n e
057 164 144 076 012 146 151 170 145 144 137 154 151 156 145 040
0000140 \t \t \t \t \t \t < t d > 2 . 6 u p
011 011 011 011 011 011 074 164 144 076 062 056 066 040 165 160
0000160 t o 3 . 5 G H z - 6
040 164 157 040 063 056 065 040 107 110 172 040 055 040 066 040
0000200 M B c a c h e - 4 c o r
115 102 040 143 141 143 150 145 040 055 040 064 040 143 157 162
0000220 e s - 8 t h r e a d s < /
145 163 040 055 040 070 040 164 150 162 145 141 144 163 074 057
0000240 t d > \n
164 144 076 012
0000244
theuser@thesystem:~$
Things I've tried:
https://stackoverflow.com/a/19645137/257924 implies using an encoding
of ascii
, but it chokes inside the requests library. Changing the
script to be:
#encoding = 'ISO-8859-1'
encoding = 'ascii' # try https://stackoverflow.com/a/19645137/257924
r.encoding = encoding
data = r.text.encode(encoding)
yields:
theuser@thesystem:~$ ./simple_wget
Traceback (most recent call last):
File "./simple_wget.py", line 18, in <module>
data = r.text.encode(encoding)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10166-10168: ordinal not in range(128)
Changing the last line above to be
data = r.text.encode(encoding, "ignore")
results in the dashes just being removed, not translated which is not what I want.
And this also does not work at all:
encoding = 'ISO-8859-1'
r.encoding = encoding
data = r.text.encode(encoding)
charmap = {
0x2014: u'-', # em dash
0x201D: u'"', # comma quotation mark, double
# etc.
}
data = data.translate(charmap)
because it gives this error:
Traceback (most recent call last):
File "./simple_wget.py", line 30, in <module>
data = tmp2.translate(charmap)
TypeError: expected a string or other character buffer object
which is, as far as I can understand from https://stackoverflow.com/a/10385520/257924, due to "data" not being a unicode string. A 256-character translation table is not going to do what I need anyhow. And besides that is overkill: something inside Python should translate these multi-byte characters without requiring hack code at my script level.
By the way, I'm not interested in multi-lingual page translation. All pages translated are expected to be in US or British English.