1

A Web site has inner HTML built in it

Beautiful soup is not extracting embeded HTML codes.

I need to extract div element with class = qwjRop

for e.g. not able to extract "At this price good" form div tag

import requests
from bs4 import BeautifulSoup

url="https://www.flipkart.com/hp-pentium-quad-core-4-gb-1-tb-hdd-dos-15-be010tu-notebook/product-reviews/itmeprzhy4hs4akv?page1&pid=COMEPRZBAPXN2SNF"


def clawler(in_url):
    source_code = requests.get(in_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "html.parser")    

    for name in soup.findAll('div',{'class':'qwjRop'}):
       print(name.prettify())

2 Answers2

1

The page is rendered with JavaScript you can use Selenium to render it:

First install Selenium:

sudo pip3 install selenium

Then get a driver https://sites.google.com/a/chromium.org/chromedriver/downloads you can use a headless version of chrome "Chrome Canary" if you are on Windows or Mac.

import bs4 as bs
from selenium import webdriver  
browser = webdriver.Chrome()
url="https://www.flipkart.com/hp-pentium-quad-core-4-gb-1-tb-hdd-dos-15-be010tu-notebook/product-reviews/itmeprzhy4hs4akv?page1&pid=COMEPRZBAPXN2SNF"
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "html.parser")
for name in soup.findAll('div',{'class':'qwjRop'}):
   print(name.prettify())

Or for other non-selenium methods see my answer to Scraping Google Finance (BeautifulSoup)

Dan-Dev
  • 8,957
  • 3
  • 38
  • 55
0

of course we can use Selenium as the friends said before. Here I'd like to introduce another tools, you can use it like Scrapy, it is called scrapy_splash, a plugin of Scrapy created by the Scrapy team. use pip install scrapy_splash and enjoy it, the documentation is detailed you can write like this and the scrapy_splash will render the site for you

import scrapy
import scrapy_splash as scrapys
class StaticsSpider(scrapy.Spider):
    name = 'statics'
    start_urls = [
    'https://stackoverflow.com/',
    ]
    def start_requests(self):
        for item in self.start_urls:
            yield scrapys.SplashRequest(
                item, callback=self.parse, args={'wait': 0.5})

    def parse(self, response):
        ......

the response will be rendered website, you can use it in the same way if you know how to deal with response in the scrapy

Forsworn
  • 112
  • 1
  • 10