Not able to find a link in a product page

Question

I am trying to make a list of the links that are inside a product page.

I have multiple links through which I want to get the links of the product page.

I am just posting the code for a single link.

r = requests.get("https://funskoolindia.com/products.php?search=9723100")
soup = BeautifulSoup(r.content)
for a_tag in soup.find_all('a', class_='product-bg-panel', href=True):
    print('href: ', a_tag['href'])

This is what it should print: https://funskoolindia.com/product_inner_page.php?product_id=1113

Possible duplicate of [BeautifulSoup getting href](https://stackoverflow.com/questions/5815747/beautifulsoup-getting-href) — m13op22, Aug 16 '19 at 14:26
Maybe [this](https://stackoverflow.com/questions/41745514/getting-the-href-of-a-tag-which-is-in-li)? — m13op22, Aug 16 '19 at 14:28

Ajax1234 · Answer 1 · 2019-08-16T14:36:55.977

2

The site is dynamic, thus, you can use selenium

from bs4 import BeautifulSoup as soup
from selenium import webdriver
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://funskoolindia.com/products.php?search=9723100')
results = [*{i.a['href'] for i in soup(d.page_source, 'html.parser').find_all('div', {'class':'product-media light-bg'})}]

Output:

['product_inner_page.php?product_id=1113']

edited Aug 16 '19 at 14:36

answered Aug 16 '19 at 14:30

Ajax1234

69,937
8
61
102

for dynamically loaded pages I'd almost always say it's worth Scrapy instead – Aero Blue Aug 16 '19 at 14:40
@AeroBlue You are probably right, however, I do not use Scrapy :) Could you post a solution with Scrapy? – Ajax1234 Aug 16 '19 at 14:41
what should be the path of the driver.i don't see any chrome driver in disk.?..This gives me error – james joyce Aug 16 '19 at 14:58

chiko360 · Answer 2 · 2019-08-16T14:49:51.640

1

try this : print('href: ', a_tag.get("href")) and add features="lxml" to the BeautifulSoup constructor

edited Aug 16 '19 at 14:49

answered Aug 16 '19 at 14:30

chiko360

33
1
8

score 1 · Accepted Answer · answered Aug 16 '19 at 14:43

The data are loaded dynamically through Javascript from different URL. One solution is using selenium - that executes Javascript and load links that way.

Other solution is using re module and parse the data url manually:

import re
import requests
from bs4 import BeautifulSoup

url = 'https://funskoolindia.com/products.php?search=9723100'
data_url = 'https://funskoolindia.com/admin/load_data.php'

data = {'page':'1',
    'sort_val':'new',
    'product_view_val':'grid',
    'show_list':'12',
    'brand_id':'',
    'checkboxKey': re.findall(r'var checkboxKey = "(.*?)";', requests.get(url).text)[0]}

soup = BeautifulSoup(requests.post(data_url, data=data).text, 'lxml')

for a in soup.select('#list-view .product-bg-panel > a[href]'):
    print('https://funskoolindia.com/' + a['href'])

Prints:

https://funskoolindia.com/product_inner_page.php?product_id=1113

this work fine ,but now i have to get the details of the product from the extracted urls,i think that will be dynamic too,so what do i do ..is this `re` method will on the extracted links or do i have to work with selenium.? — james joyce, Aug 16 '19 at 14:56
@jamesjoyce You can experiment. `selenium` has it's overhead so it's slower than `requests` + `re` method. I suggest to look at Chrome/Firefox developer tools and see from where the page is loading the data - and then use that url with `requests`. — Andrej Kesely, Aug 16 '19 at 14:59

Not able to find a link in a product page

3 Answers3