4

I am trying to fetch data using selenium webdriver and beautiful soup from a website. Below segment of code is taking a long time to execute.

time1 = time.time()
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(chrome_options=options)
driver.get("https://www.bseindia.com/")
elem = driver.find_element_by_id("suggestBoxEQ")
elem.clear()
elem.send_keys("538707")  
elem.send_keys(Keys.RETURN)
print(driver.current_url)
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
driver.quit()
time2 = time.time()
print(time2-time1)

It takes 13.876seconds to execute this code. Is there any way to speed up execution time of current code or another faster method to get fetch the data?

yadav
  • 99
  • 3
  • 9
  • What data do you want to scrape? Entering the number redirects to this URL: https://www.bseindia.com/stock-share-price/rajasthan-cylinders--containers-ltd/rccl/538707/ – Keyur Potdar Apr 14 '18 at 17:40
  • 1
    I need to get "Trade Date, Quantity Traded, Deliverable Quantity, % of Deliverable Quantity to Traded Quantity, market cap, security ID " etc for a company listed in BSE. TO achieve this I am getting dynamic content of the page using selenium and the processing it using beautifulsoup. – yadav Apr 14 '18 at 17:45

1 Answers1

3

After entering 538707 on this URL, the page redirects to this URL.

The tables and other data on this page are loaded from AJAX requests. You can directly get data from these requests for scraping. To see the AJAX requests, go the XHR tab under the Network tab in Developer tools and refresh the page. You can get data from the XHR requests you see here.

For example, the table Securitywise Delivery Position is loaded from this URL. So, you can directly get the table like this:

import requests

r = requests.get('https://www.bseindia.com/stock-share-price/SiteCache/SecurityPosition.aspx?Type=EQ&text=538707')
soup = BeautifulSoup(r.text, 'lxml')
table = soup.find('table')

Scrape this table using BeautifulSoup. For example:

print(table.find('td', class_='newseoscripfig').text)
# 13 Apr 2018

Similarly, you can find nearly all the data that is loaded dynamically in other XHR requests. As Selenium is not used here, the script is pretty fast.

Keyur Potdar
  • 7,158
  • 6
  • 25
  • 40
  • Keyur, I am unable to get data for some tables like "Stock Trading", "Market Depth" using XHR. https://www.bseindia.com/stock-share-price/SiteCache/Stock_Trading.aspx?text=538707&type=EQ , https://www.bseindia.com/stock-share-price/SiteCache/Stock_Trading.aspx?text=538707&type=EQ – yadav Apr 15 '18 at 15:49
  • It seems that [this url](https://www.bseindia.com/stock-share-price/SiteCache/Stock_Trading.aspx?text=538707&type=EQ) needs headers to make the request. Use the headers that are shown in the *Request Headers*. [This Q/A on SO](https://stackoverflow.com/questions/6260457/using-headers-with-the-python-requests-librarys-get-method) might help you to do that. [This too](https://stackoverflow.com/questions/8685790/adding-header-to-python-request-module/36634227#36634227). – Keyur Potdar Apr 15 '18 at 17:26
  • In this answer I've shown you the general idea to scrape all the dynamic elements. If you have any more problems regarding a specific XHR, I think you should ask a new question for that. If I add those things in this answer, it'll become a bit broad and out of scope. – Keyur Potdar Apr 15 '18 at 17:32
  • Sure, I will ask a new question. XHR is really fast compared to selenium. Thanks – yadav Apr 15 '18 at 17:39
  • Yes, I always send header along with requests. I don't want my self blocked :P – yadav Apr 15 '18 at 17:47
  • No, It didn't work. If it requires header then it should get data in browser. Brower will be sending header information. – yadav Apr 15 '18 at 17:55
  • 1
    I got the response after I added the *Referer* header. And no, if you copy and open the link on another tab, you won't see the data if other headers like referer, origin, etc are required. – Keyur Potdar Apr 15 '18 at 17:56
  • r = requests.get('https://www.bseindia.com/stock-share-price/SiteCache/Stock_Trading.aspx?text=538707&type=EQ', headers=header) soup = BeautifulSoup(r.text, 'lxml') print(soup) ## I am using fake user agent. It returns no data – yadav Apr 15 '18 at 17:59
  • I am getting data in r. Mean server is sending data. But soup object is empty – yadav Apr 15 '18 at 18:03
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/169035/discussion-between-keyur-potdar-and-yadav). – Keyur Potdar Apr 15 '18 at 18:13