0

I have watched a video that teaches how to use BeautifulSoup and requests to scrape a website Here's the code

from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd

pages_to_scrape = 1

for i in range(1,pages_to_scrape+1):
    url = ('http://books.toscrape.com/catalogue/page-{}.html').format(i)
    pages.append(url)
for item in pages:
    page = requests.get(item)
    soup = bs4(page.text, 'html.parser')
    #print(soup.prettify())
for j in soup.findAll('p', class_='price_color'):
    price=j.getText()
    print(price)

The code i working well. But as for the results I noticed weird character before the euro symbol and when checking the html source, I didn't find that character. Any ideas why this character appears? and how this be fixed .. is using replace enough or there is a better approach?

YasserKhalil
  • 9,138
  • 7
  • 36
  • 95
  • Sounds like you don't understand character sets, and are looking at UTF-8 with some legacy character set enabled, like maybe Windows code page 1251. – tripleee Nov 26 '20 at 18:21
  • Possible duplicate of https://stackoverflow.com/questions/10611455/what-is-character-encoding-and-why-should-i-bother-with-it; see also https://meta.stackoverflow.com/questions/379403/problematic-questions-about-decoding-errors – tripleee Nov 26 '20 at 18:21

2 Answers2

2

You could use page.content.decode('utf-8') instead of page.text. As people in the comments said, it is an encoding issue, and .content returns HTML as bytes, then you can convert it into string with right encoding using .decode('utf-8'), whereas .text returns string with bad encoding (maybe cp1252). The final code may look like this:

from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd

pages_to_scrape = 1
pages = [] # You forgot this line

for i in range(1,pages_to_scrape+1):
    url = ('http://books.toscrape.com/catalogue/page-{}.html').format(i)
    pages.append(url)
for item in pages:
    page = requests.get(item)
    soup = bs4(page.content.decode('utf-8'), 'html.parser') # Replace .text with .content.decode('utf-8')
    #print(soup.prettify())
for j in soup.findAll('p', class_='price_color'):
    price=j.getText()
    print(price)

This should hopefully work

P.S: Sorry for directly writing the answer, I don't have enought reputation to write in comments :D

XazkerBoy
  • 133
  • 7
2

Seems for me you explained your question wrongly. I assume that you are using Windows where your terminal IDLE is using the default encoding of cp1252,

But you are dealing with UTF-8, you've to configure your terminal/IDLE with UTF-8

import requests
from bs4 import BeautifulSoup


def main(url):
    with requests.Session() as req:
        for item in range(1, 10):
            r = req.get(url.format(item))
            print(r.url)
            soup = BeautifulSoup(r.content, 'html.parser')
            goal = [(x.h3.a.text, x.select_one("p.price_color").text)
                    for x in soup.select("li.col-xs-6")]
            print(goal)


main("http://books.toscrape.com/catalogue/page-{}.html")
  1. try to always use The DRY Principle which means Don’t Repeat Yourself”.
  2. Since you are dealing with the same host so you've to maintain the same session instead of keep open tcp socket stream and then close it and then open it again. That's can lead to block your requests and consider it as DDOS attack where the TCP flags got captured by the back-end. imagine that you open your browser and then open a website then you close it and repeat the circle!
  3. Python functions is usually looks nice and easy to read instead of letting code looks like journal text.

Notes: the usage of range() and {} format string, CSS selectors.

  • How can I extract the number of stars which is attribute? I tried to modify the code you have posted like that `(x.h3.a.text, x.select_one("p.price_color").text, x.select_one("p.star-rating").attrs.items())` but I didn't get it. I know it is wrong but how can I get the attribute value? – YasserKhalil Nov 26 '20 at 20:13
  • I can get in the result as for stars `dict_items([('class', ['star-rating', 'Three'])])`. How can I get `Three` only as a result? – YasserKhalil Nov 26 '20 at 20:22
  • 1
    @YasserKhalil `(x.h3.a.text, x.select_one("p.star-rating")['class'][-1], x.select_one("p.price_color").text)` – αԋɱҽԃ αмєяιcαη Nov 26 '20 at 21:01