-1

i try this code in Python 3.8.2:

from bs4 import BeautifulSoup
import urllib.request

html = urllib.request.urlopen(
    'https://vietnamnet.vn/').read()

soup = BeautifulSoup(html, "html.parser").encode("utf-8")

print(soup.title)

but i received:

enter image description here

instead of expected: <title>Báo VietNamNet - Tin tức online, tin nhanh Việt Nam và thế giới</title>

what am i doing wrong and how can i fix it?

I have to use .encode("utf-8") because html string contains unicode character. Does it effect the soup?

Thanks!

Quang Thái
  • 649
  • 5
  • 17
  • `title` is a function, so you have to call the function: `print(soup.title())`, otherwise you get the function object itself. – MrBean Bremen Jun 30 '20 at 06:13

3 Answers3

0

When you run .encode() on the parser you are assigning a byte string to soup. The parser is completely lost, as .encode() returns a byte string.

bs4 should handle the character set for you.

soup = BeautifulSoup(html, "html.parser")
print(soup.title)

Output:

>>> from bs4 import BeautifulSoup
>>> import urllib.request
>>> html = urllib.request.urlopen(
...     'https://vietnamnet.vn/').read()

>>> soup = BeautifulSoup(html, "html.parser")
>>> print(soup.title)
<title>Báo VietNamNet - Tin tức online, tin nhanh Việt Nam và thế giới</title>
>>> 
Robert Kearns
  • 1,631
  • 1
  • 8
  • 15
  • can you try my code? because if i dont pass the .encode(), python raises this error: "UnicodeEncodeError: 'charmap' codec can't encode character '\u1ee9' in position 29: character maps to " – Quang Thái Jun 30 '20 at 05:18
  • Yes the code works well for me, what version of Python are you using? I will edit my answer with my input/output. – Robert Kearns Jun 30 '20 at 05:20
  • i dont know what is going wrong? Im using Windows 10, python 3.8.2 – Quang Thái Jun 30 '20 at 05:23
0
from bs4 import BeautifulSoup
import urllib.request

html = urllib.request.urlopen(
    'https://vietnamnet.vn/').read().decode("utf-8")

soup = BeautifulSoup(html, "html.parser")

title = soup.title
print(title)
print(title.string)

You have to decode while reading.

Vishal Dhawan
  • 351
  • 3
  • 9
  • i tried this but received this error: "UnicodeEncodeError: 'charmap' codec can't encode character '\u1ee9' in position 29: character maps to " – Quang Thái Jun 30 '20 at 05:19
  • This is working for me (tried it on 3.8.2 also). Are you trying to write the data to a file? Take a look here https://stackoverflow.com/questions/27092833/unicodeencodeerror-charmap-codec-cant-encode-characters – Vishal Dhawan Jun 30 '20 at 05:34
-1

After reading your url, you should be decoding it using the appropriate encoding format like below.

from urllib import request
from bs4 import BeautifulSoup
import urllib.request

html = urllib.request.urlopen(
    'https://vietnamnet.vn/').read().decode('utf8')

soup = BeautifulSoup(html, "html.parser")
title = soup.find('title')

print("title is :", title)
  • i got this error while trying your code: "UnicodeEncodeError: 'charmap' codec can't encode character '\u1ee9' in position 29: character maps to " – Quang Thái Jun 30 '20 at 05:24