1

I am trying to learn data scraping using python and have been using the Requests and BeautifulSoup4 libraries. It works well for normal html websites. But when I tried to get some data out of websites where the data loads after some delay, I found that I get an empty value. An example would be

from bs4 import BeautifulSoup
from operator import itemgetter
from selenium import webdriver
url = "https://www.example.com/;1"
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
a = soup.find('span', 'buy')
print(a)

I am trying to grab the from here: (value)

I have already referred a similar topic and tried executing my code on similar lines as the solution provided here. But somehow it doesnt seem to work. I am a novice here so need help getting this work. How to scrape html table only after data loads using Python Requests?

The table (content) is probably generated by JavaScript and thus can't be "seen". I am using python3.6 / PhantomJS / Selenium as proposed by a lot of answers here.

fazal
  • 15
  • 1
  • 7
  • You can use some of this http://selenium-python.readthedocs.io/waits.html or just add `time.sleep(n)` – amarynets Oct 04 '17 at 20:43
  • can you please check the url? It seems the ; there is a typo and might be causing the error in your scraper – jabargas Oct 04 '17 at 20:45
  • @AndMar time.sleep doesnt seem to work in this case. Please suggest where exactly you propose for me to add? – fazal Oct 04 '17 at 20:57
  • @jabargas same code works if i just change soup.find('span', 'buy') to soup.find('span', 'btc') which is just static content instead of dynamic content that gets loaded in a few seconds after the page loads. So i doubt there is any issue with url. – fazal Oct 04 '17 at 20:59
  • [Try `browser.implicitly_wait(n)` where `n` is an integer for the amount of seconds.](http://selenium-python.readthedocs.io/waits.html#implicit-waits) – Luke Oct 04 '17 at 21:00
  • @user2833726 between `browser.get(url)` and `html = browser.page_source` – amarynets Oct 04 '17 at 21:01
  • can you please tell me what is the content that you want to get? Cause when you use your regular browser and go to https://www.zebpay.com/;1, you'll get a Page Not Found Error. – jabargas Oct 04 '17 at 21:04
  • @LukaszSalitra where should i add wait? I added after browser.get(url) and it doesnt work – fazal Oct 04 '17 at 21:04
  • @jabargas just visit zebpay.com and I am trying to get value in the html 289,162 The reason i used zebpay.com;1 was following the stackoverflow code link that I was referring to. – fazal Oct 04 '17 at 21:07
  • @AndMar I tried adding sleep as proposed. No luck. – fazal Oct 04 '17 at 21:15
  • @user2833726 Have you tried to use Chrome instead PhantomJs? – amarynets Oct 04 '17 at 21:17
  • @AndMar doesnt make any difference. – fazal Oct 07 '17 at 16:16

2 Answers2

2

You have to run headless browser to run delayed scraping. Please use selenium. Here is sample code. Code is using chrome browser as driver

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Chrome(<chromedriver path here>)
browser.set_window_size(1120, 550)
browser.get(link)
element = WebDriverWait(browser, 3).until(
   EC.presence_of_element_located((By.ID, "blabla"))
)
data = element.get_attribute('data-blabla')
print(data)
browser.quit()
songxunzhao
  • 3,149
  • 1
  • 11
  • 13
  • it works like a charm using chrome webdriver. But it actually opens up the browser window. Instead is there something similar with headless browser? Maybe if you have similar code for phantomjs or so that doesnt open a physical browser but works under the hood sort of a console windows or so? Thanks again. Once I get your response i will mark this post as answered. – fazal Oct 20 '17 at 09:29
  • Please replace webdriver.chorme() to webdriver.PhantomJS(). All other process is same. – songxunzhao Oct 22 '17 at 02:47
0

You can access desired values by requesting it directly from API and analyze JSON response.

import requests
import json

res = request.get('https://api.example.com/api/')
d = json.loads(res.text)

print(d['market'])
fazal
  • 15
  • 1
  • 7
Roman Mindlin
  • 852
  • 1
  • 8
  • 12
  • 1
    Thanks for the response. Although an api will do for this site. The original idea was to still understand how to get the value scraped on such site where there is a slight delay in data loading. That is the key ask from this post. – fazal Oct 07 '17 at 16:15