0

I'm trying to parse youtube with beautifulsoup, but without luck. I've parsed many websites which all went perfect, but this ones doesn't work and gives me this error:

UnicodeEncodeError: 'charmap' codec can't encode character '\u2117' in position 135588: character maps to <undefined>

I decoded it as following:

page_soup = soup(page_html.decode("utf-8"), "html.parser")


x = page_soup.find('div',{'id':"dismissable"})

I still get the error above. but when i try this:

Code:

page_soup = soup(page_html, "html.parser").encode("utf-8")

with encoding it i'm able to print out my webpage, but when i search in it as following:

search_list = page_soup.find_all('div',{'class':"style-scope ytd-video-renderer"})

print(len(search_list))

I get the following Error:

TypeError: slice indices must be integers or None or have an __index__ method

Any advice would be welcome.

much thanks.

additionally my code:

import urllib3
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen

import requests

http = urllib3.PoolManager()
set_Link = set([''])

url = 'https://www.youtube.com/results?search_query=the+lumineers+sleep+on+the+floor'

r = http.request('get',url)

page_html = r.data #html data opslaan in variabele

page_soup = soup(page_html, "html.parser").encode("utf-8")


print(page_soup)

search_list = page_soup.find_all('div',{'class':"style-scope ytd-video-renderer"})

print(len(search_list))
ron_g
  • 1,474
  • 2
  • 21
  • 39
Jeroen F
  • 21
  • 1
  • 4

2 Answers2

2

Your code applies decode() in the wrong place, hence the exception:

page_soup = soup(page_html.decode("utf-8"), "html.parser") 
Apalala
  • 9,017
  • 3
  • 30
  • 48
  • Thank you for the answer. i did exactly what you said but still got the same error: UnicodeEncodeError: 'charmap' codec can't encode character '\u2117' in position 149996: character maps to – Jeroen F Oct 28 '18 at 13:24
  • That means that the page is not in `utf-8` (probably some legacy MS Windows encoding?). Look at this [SO Q&A](https://stackoverflow.com/questions/436220) about detecting encoding. – Apalala Oct 29 '18 at 19:25
  • I digged a bit deaper in it, but as it seems for me, the data i want to collect isn't available, because there's javascript behind it =/ I'll have to use the youtube API – Jeroen F Nov 05 '18 at 08:08
0

Just some advice for the first half of your question - you should use the 'unicode sandwich' approach and save yourself a lot of frustration:

  1. Make your input unicode (BeautifulSoup does this for you)
  2. Process in unicode
    • if you want to print() something, use print(repr(string))
  3. Encode your output as required

Your first problem, the UnicodeEncodeError - was that the result of using a print statement on a string? If so, print like this:

print(repr(string))

to avoid encoding issues and keep your data in unicode until the end.

I.e. don't do this: page_soup = soup(page_html, "html.parser").encode("utf-8") just to print out the result.

ron_g
  • 1,474
  • 2
  • 21
  • 39