how to get inner html properties of a div tag in beautifulsoup

Question

A Web site has inner HTML built in it

Beautiful soup is not extracting embeded HTML codes.

I need to extract div element with class = qwjRop

for e.g. not able to extract "At this price good" form div tag

import requests
from bs4 import BeautifulSoup

url="https://www.flipkart.com/hp-pentium-quad-core-4-gb-1-tb-hdd-dos-15-be010tu-notebook/product-reviews/itmeprzhy4hs4akv?page1&pid=COMEPRZBAPXN2SNF"


def clawler(in_url):
    source_code = requests.get(in_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "html.parser")    

    for name in soup.findAll('div',{'class':'qwjRop'}):
       print(name.prettify())

Can you give us a sample of the HTML you're having problems parsing? And what exactly do you mean with "embedded HTML codes"? Do you mean an iframe? — geekonaut, Aug 05 '17 at 11:00

Dan-Dev · Accepted Answer · 2017-08-05T11:59:59.400

The page is rendered with JavaScript you can use Selenium to render it:

First install Selenium:

sudo pip3 install selenium

Then get a driver https://sites.google.com/a/chromium.org/chromedriver/downloads you can use a headless version of chrome "Chrome Canary" if you are on Windows or Mac.

import bs4 as bs
from selenium import webdriver  
browser = webdriver.Chrome()
url="https://www.flipkart.com/hp-pentium-quad-core-4-gb-1-tb-hdd-dos-15-be010tu-notebook/product-reviews/itmeprzhy4hs4akv?page1&pid=COMEPRZBAPXN2SNF"
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "html.parser")
for name in soup.findAll('div',{'class':'qwjRop'}):
   print(name.prettify())

Or for other non-selenium methods see my answer to Scraping Google Finance (BeautifulSoup)

Thank you, very much, scratching my head since morning to solve this problem. — Zain Danish, Aug 05 '17 at 12:21

score 0 · Answer 2 · answered Jul 27 '18 at 10:18

of course we can use Selenium as the friends said before. Here I'd like to introduce another tools, you can use it like Scrapy, it is called scrapy_splash, a plugin of Scrapy created by the Scrapy team. use pip install scrapy_splash and enjoy it, the documentation is detailed you can write like this and the scrapy_splash will render the site for you

import scrapy
import scrapy_splash as scrapys
class StaticsSpider(scrapy.Spider):
    name = 'statics'
    start_urls = [
    'https://stackoverflow.com/',
    ]
    def start_requests(self):
        for item in self.start_urls:
            yield scrapys.SplashRequest(
                item, callback=self.parse, args={'wait': 0.5})

    def parse(self, response):
        ......

the response will be rendered website, you can use it in the same way if you know how to deal with response in the scrapy

how to get inner html properties of a div tag in beautifulsoup

2 Answers2