BeautifulSoup parser and cirillic characters

Question

guys!

I'm trying to parse this URL http://mapia.ua/ru/search?&city=%D0%9D%D0%B8%D0%BA%D0%BE%D0%BB%D0%B0%D0%B5%D0%B2&page=1&what=%D0%BE%D0%BE%D0%BE using BeautifulSoup.

But I have got a strange characters like this �� 1 �� "��"

Here is my code

from bs4 import BeautifulSoup
import urllib.request

URL = urllib.request.urlopen('http://mapia.ua/ru/search?city=%D0%9D%D0%B8%D0%BA%D0%BE%D0%BB%D0%B0%D0%B5%D0%B2&what=%D0%BE%D0%BE%D0%BE&page=1').read()

soup = BeautifulSoup(URL, 'html.parser')

print(soup.h3.get_text())

Can anybody help me?

P.S. I'm using python 3

The issue is the with the shell you are using to output the data, I get `ЖЭК №1 ООО "Дуэт"` as my default encoding is utf-8, your accepted answer actually causes it to not work. — Padraic Cunningham, Jun 02 '16 at 19:33

score -1 · Accepted Answer · edited May 23 '17 at 12:24

-1

I found this :

import urllib.request
with urllib.request.urlopen('http://python.org/') as response:
   html = response.read()
soup = BeautifulSoup(html.decode('utf-8', 'ignore').encode("utf-8"))

From:

How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?

Also:

Delete every non utf-8 symbols froms string

Hope it helps ;)

edited May 23 '17 at 12:24

Community

1
1

answered Jun 02 '16 at 12:11

Destrif

2,104
1
14
22

Sorry! It doesn't help :( – andrii1986 Jun 02 '16 at 14:07
Sorry forget the .encode("utf-8") at the end, this will remove all non utf8 char. If you want less specific char you will have to do it by regex. – Destrif Jun 02 '16 at 15:01

BeautifulSoup parser and cirillic characters

1 Answers1