-1

I am using python3 and beautifulsoup to scrape a website but i got this error. I tried to fix this using the solutions given in other answers but none solves my problem.

# -*- coding: utf-8 -*-
import os
import locale
os.environ["PYTHONIOENCODING"] = "utf-8"
myLocale=locale.setlocale(category=locale.LC_ALL, locale="en_GB.UTF-8")

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import pandas as pd


def getrank (animeurl):
    html = urlopen(animeurl)
    bslink = BeautifulSoup(html.read(), 'html.parser')
    
    rank = bslink.find('span', {'class' : 'numbers ranked'}).get_text().replace('Ranked #', '')
    


def spring19():
    html = urlopen('https://...')
    bs = BeautifulSoup(html.read(), 'html.parser')
    
    link = []
    for x in bs.find_all('a', {'class' : 'link-title'}):
        link.append(x.get("href"))
    
    
    
    ranklist = []
    for x in link:
        x.encode(encoding='UTF-8',errors='ignore')
        ranklist.append(getrank(x))
    
    return ranklist

spring19()


the error message is : UnicodeEncodeError: 'ascii' codec can't encode character '\u2159' in position 32: ordinal not in range(128)

The reason why this error showed up is that there are some symbols in the urls i scraped. But I still have no idea how should i fix it.

Thanks a lot!

frifin
  • 109
  • 5
  • Have you tried another type of encoding? Windows-1252 for example? You should be able to obtain the encoding used for the webpage from the HTML itself (in the head par, the charset meta element), or, possibly better, from the header provided by the server (which BeautifulSoup won't have any knowledge of; it gets lost once you've downloaded the document). – 9769953 Jun 28 '19 at 07:09
  • Please indicate *where* exactly the error occurs in your script. – 9769953 Jun 28 '19 at 07:10
  • You never assign the result of `x.encode(encoding='UTF-8',errors='ignore')` to anything. `x` remains whatever it was before, since the result of the encoding is thrown away. – 9769953 Jun 28 '19 at 07:12
  • Thanks so much for your help! the encoding used for the website is actually utf-8. But I figure out that the reason for this error is that there are some symbols such as ☆ in the urls i scrapped, but i am still working to fix this problem. – frifin Jun 28 '19 at 08:53

1 Answers1

0

Solved this problem with the solutions from : How to convert a url string to safe characters with python?

code modified as below:

    ranklist = []
    for x in link:
        x = quote(x, safe='/:?=&')
        ranklist.append(getrank(x))
frifin
  • 109
  • 5