1

I am trying to run a Python script that gets some data from here.

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup


url = urllib.request.urlopen('http://database.ukrcensus.gov.ua/PXWEB2007/ukr/news/op_popul.asp')
soup = BeautifulSoup(url, 'html.parser')
print(soup)

It runs just fine on Mac but when I try to run it on Linux I get this kind of output:

<area alt="������� ��������" coords="2,5,21,26" href="../../index.htm" shape="rect" title="������� ��������"/>
<area alt="����� �����" coords="43,6,62,26" href="../../ukr/help/web_map.asp" shape="rect" title="����� �����"/>
<area alt="��������� ��'����" coords="85,7,105,27" href="../../ukr/help/contact.asp" shape="rect" title="��������� ��'����"/>

I guess there is something wrong with encoding/decoding but I cannot really figure out what exactly. Thanks in advance.

Maxiboi
  • 150
  • 1
  • 8
  • Seems like your missing the correct charset. Have you tried converting it to ASCII? – Cobalt Oct 03 '20 at 10:35
  • How can I do that? `ord()` or `char()` did not work – Maxiboi Oct 03 '20 at 10:52
  • Running your script on linux prints the characters correctly. What's your terminal? It's set to show Unicode characters? – Andrej Kesely Oct 03 '20 at 13:21
  • The thing is, it's not just in the terminal. It also loads not decoded characters into an sql database – Maxiboi Oct 03 '20 at 13:30
  • We can’t tell you the correct encoding without seeing (a representative, ideally small sample of) the actual contents of the data in an unambiguous representation; a hex dump of the problematic byte(s) with a few bytes of context on each side is often enough, especially if you can tell us what you think those bytes are supposed to represent. See also https://meta.stackoverflow.com/questions/379403/problematic-questions-about-decoding-errors – tripleee Oct 03 '20 at 13:53

1 Answers1

2

You can try explicitly decode the response with selected encoding. For example:

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup

response = urllib.request.urlopen('http://database.ukrcensus.gov.ua/PXWEB2007/ukr/news/op_popul.asp')
html = response.read().decode('windows-1251')  # <--- explicitly decode the response using 'windows-1251' encoding
soup = BeautifulSoup(html, 'html.parser')
print(soup)

Prints:

...

<map name="Map">
<area alt="Українською" coords="8,5,82,19" href="../../../../index.htm" shape="rect"/>
<area alt="По-русски" coords="8,22,83,37" href="../../../../index.htm#" shape="rect"/>
<area alt="In English" coords="8,39,83,56" href="../../../../index.htm#" shape="rect"/>
</map>
<map name="Map2">
<area alt="Головна сторінка" coords="2,5,21,26" href="../../index.htm" shape="rect" title="Головна сторінка"/>
<area alt="Карта сайту" coords="43,6,62,26" href="../../ukr/help/web_map.asp" shape="rect" title="Карта сайту"/>
<area alt="Зворотній зв'язок" coords="85,7,105,27" href="../../ukr/help/contact.asp" shape="rect" title="Зворотній зв'язок"/>
</map>
<map name="Map3">
<area coords="63,3,155,26" href="http://www.ukrcensus.gov.ua/index.php" shape="rect"/>
<area coords="65,25,162,44" href="http://www.ukrcensus.gov.ua/index.php" shape="rect"/>
</map>
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91