beautifulsoup webscraping problems

Question

I'm trying to parse youtube with beautifulsoup, but without luck. I've parsed many websites which all went perfect, but this ones doesn't work and gives me this error:

UnicodeEncodeError: 'charmap' codec can't encode character '\u2117' in position 135588: character maps to <undefined>

I decoded it as following:

page_soup = soup(page_html.decode("utf-8"), "html.parser")


x = page_soup.find('div',{'id':"dismissable"})

I still get the error above. but when i try this:

Code:

page_soup = soup(page_html, "html.parser").encode("utf-8")

with encoding it i'm able to print out my webpage, but when i search in it as following:

search_list = page_soup.find_all('div',{'class':"style-scope ytd-video-renderer"})

print(len(search_list))

I get the following Error:

TypeError: slice indices must be integers or None or have an __index__ method

Any advice would be welcome.

much thanks.

additionally my code:

import urllib3
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen

import requests

http = urllib3.PoolManager()
set_Link = set([''])

url = 'https://www.youtube.com/results?search_query=the+lumineers+sleep+on+the+floor'

r = http.request('get',url)

page_html = r.data #html data opslaan in variabele

page_soup = soup(page_html, "html.parser").encode("utf-8")


print(page_soup)

search_list = page_soup.find_all('div',{'class':"style-scope ytd-video-renderer"})

print(len(search_list))

score 2 · Answer 1 · answered Oct 28 '18 at 04:00

2

Your code applies decode() in the wrong place, hence the exception:

page_soup = soup(page_html.decode("utf-8"), "html.parser")

answered Oct 28 '18 at 04:00

Apalala

9,017
3
30
48

Thank you for the answer. i did exactly what you said but still got the same error: UnicodeEncodeError: 'charmap' codec can't encode character '\u2117' in position 149996: character maps to – Jeroen F Oct 28 '18 at 13:24
That means that the page is not in `utf-8` (probably some legacy MS Windows encoding?). Look at this [SO Q&A](https://stackoverflow.com/questions/436220) about detecting encoding. – Apalala Oct 29 '18 at 19:25
I digged a bit deaper in it, but as it seems for me, the data i want to collect isn't available, because there's javascript behind it =/ I'll have to use the youtube API – Jeroen F Nov 05 '18 at 08:08

score 0 · Answer 2 · answered Oct 31 '18 at 16:13

Just some advice for the first half of your question - you should use the 'unicode sandwich' approach and save yourself a lot of frustration:

Make your input unicode (BeautifulSoup does this for you)
Process in unicode
- if you want to print() something, use print(repr(string))
Encode your output as required

Your first problem, the UnicodeEncodeError - was that the result of using a print statement on a string? If so, print like this:

print(repr(string))

to avoid encoding issues and keep your data in unicode until the end.

I.e. don't do this: page_soup = soup(page_html, "html.parser").encode("utf-8") just to print out the result.

Thank you sir! I have to admit, i had no knowledge about the "unicode sandwich" greate advice! — Jeroen F, Nov 05 '18 at 08:06

beautifulsoup webscraping problems

2 Answers2