0

I am trying to parse a webpage and print the link for items(href). Can you help with where am i going wrong?

import requests
from bs4 import BeautifulSoup

link = "https://www.amazon.in/Power- 
Banks/b/ref=nav_shopall_sbc_mobcomp_powerbank?ie=UTF8&node=6612025031"

def amazon(url):
    sourcecode = requests.get(url)
    sourcecode_text = sourcecode.text
    soup = BeautifulSoup(sourcecode_text)

    for link in soup.findALL('a', {'class': 'a-link-normal aok-block a- 
text-normal'}):
        href = link.get('href')
        print(href)

amazon(link)

Output :

C:\Users\TIMAH\AppData\Local\Programs\Python\Python37\python.exe "C:/Users/TIMAH/OneDrive/study materials/Python_Test_Scripts/Self Basic/Class_Test.py" Traceback (most recent call last): File "C:/Users/TIMAH/OneDrive/study materials/Python_Test_Scripts/Self Basic/Class_Test.py", line 15, in amazon(link) File "C:/Users/TIMAH/OneDrive/study materials/Python_Test_Scripts/Self Basic/Class_Test.py", line 9, in amazon soup = BeautifulSoup(sourcecode_text, 'features="html.parser"') File "C:\Users\TIMAH\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4__init__.py", line 196, in init % ",".join(features)) bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: features="html.parser". Do you need to install a parser library?

Process finished with exit code 1

Alan Kavanagh
  • 9,425
  • 7
  • 41
  • 65
amit kumar
  • 13
  • 4
  • What's a `BeautifulSoap` ? – Alan Kavanagh Feb 14 '19 at 16:40
  • Its a package for parsing HTML data. – amit kumar Feb 14 '19 at 16:43
  • See this question. https://stackoverflow.com/q/24398302/494134 – John Gordon Feb 14 '19 at 16:46
  • You're not going to do much with what you have as the site blocks bots. If you read what you actually parse it says, "To discuss automated access to Amazon data please contact api-services-support@amazon.com. For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.in/ref=rm_c_sv, or our Product Advertising API at https://affiliate-program.amazon.in/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases." – chitown88 Feb 14 '19 at 16:57
  • This is just for the learning purpose, do you have any other site which i can parse simply. – amit kumar Feb 14 '19 at 17:04

3 Answers3

1

You can though add headers. Then also when you do find_all('a'), you can just get it there is href:

import requests
from bs4 import BeautifulSoup

link = "https://www.amazon.in/Power-Banks/b/ref=nav_shopall_sbc_mobcomp_powerbank?ie=UTF8&node=6612025031"

def amazon(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}

    sourcecode = requests.get(url, headers=headers)
    sourcecode_text = sourcecode.text
    soup = BeautifulSoup(sourcecode_text, 'html.parser')

    for link in soup.find_all('a', href=True):
        href = link.get('href')
        print(href)

amazon(link)
chitown88
  • 27,527
  • 4
  • 30
  • 59
0

the problem in your code you are using wrong method name findALL .. There is no findALL method in soup object, so None is returned for that. to fix that use find_all for new code , also findAll should work (with lower case double l). hope this clear things to you.

import requests
from bs4 import BeautifulSoup

link = "https://www.amazon.in/Power-Banks/b/ref=nav_shopall_sbc_mobcomp_powerbank?ie=UTF8&node=6612025031"


def amazon(url):
    sourcecode = requests.get(url)
    sourcecode_text = sourcecode.text
    soup = BeautifulSoup(sourcecode_text, "html.parser")
    # add "html.parser" as second arg , so you not get a warning .
    # use soup.find_all for new code , also soup.findAll should work 
    for link in soup.find_all('a', {'class': 'a-link-normal aok-block a-text-normal'}):
        href = link.get('href')
        print(href)

amazon(link)
Sameh Farouk
  • 549
  • 4
  • 8
0

If you tried to scrape Amazon right now with requests you won't get anything in return since Amazon will know that it's a script, and headers won't help it (as far as I know).

Instead, in response they will tell the following:

To discuss automated access to Amazon data please contact api-services-support@amazon.com.

You can scrape Amazon using requests-html or selenium by rendering it.

Requeests-html simple example scraping titles (results will be similar if you open the same link in the incognito tab):

from requests_html import HTMLSession

session = HTMLSession()
url = 'https://www.amazon.com/s?k=apple+watch+series+6+band'
r = session.get(url)
r.html.render(sleep=1, keep_page=True, scrolldown = 1)

for container in r.html.find('.a-size-medium'):
    title = container.text
    print(f"Title: {title}")

Output:

Title: New Apple Watch Series 6 (GPS, 40mm) - (Product) RED - Aluminum Case with (Product) RED - Sport Band
Title: SUPCASE [Unicorn Beetle Pro] Designed for Apple Watch Series 6/SE/5/4 [44mm], Rugged Protective Case with Strap Bands(Black)
Title: Spigen Rugged Armor Pro Designed for Apple Watch Band with Case for 44mm Series 6/SE/5/4 - Charcoal Gray
Title: Highly rated and well-priced products
Title: Fitlink Stainless Steel Metal Band for Apple Watch 38/40/42/44mm Replacement Link Bracelet Band Compatible with Apple Watch Series 6 Apple Watch Series 5 Apple Watch Series 1/2/3/4 (Grey,42/44mm)
Title: TalkWorks Compatible for Apple Watch Band 42mm / 44mm Comfort Fit Mesh Loop Stainless Steel Adjustable Magnetic Strap for iWatch Series 6, 5, 4, 3, 2, 1, SE - Rose Gold
Title: COOYA Compatible for Apple Watch Band 44mm 42mm Women Men iWatch Wristband with Protective Rugged Case Sport Strap Adjustable Replacement Band Compatible with Apple Watch Series 6 SE 5 4 3 2, Clear
Title: Stainless Steel Metal Bands Compatible with Apple Watch Band 42mm 44mm, Gold Replacement Strap with Adapter+Case Cover Compatible with iWatch Series 6 5 4 3 2 1 SE Sport
Title: elago W2 Charger Stand Compatible with Apple Watch Series 6/SE/5/4/3/2/1 (44mm, 42mm, 40mm, 38mm), Durable Silicone, Compatible with Nightstand Mode (Black)
Title: Element Case Black Ops Watch Band for Apple Watch Series 4/5/6/SE, 44mm - Black (EMT-522-244A-01)
...
Dmitriy Zub
  • 1,398
  • 8
  • 35