0

Im trying to use selenium for pages like this with chromedriver: http://shironet.mako.co.il/artist?type=lyrics&lang=1&prfid=202&wrkid=2473

The problem is that Selenium always wait until the page is finish to load (for example the youtube player there). I'm only interesting in the html source so I Don't want to wait this long. How can I make my program not to wait? I'm using python. (I'm using selenium cause urllib didn't work for this website)

Andersson
  • 51,635
  • 17
  • 77
  • 129
Yoav Cohen
  • 92
  • 2
  • 9
  • what code you are using to open the webpage..generally driver.get() method waits till the page loads – thebadguy Dec 05 '16 at 10:33
  • Hi, im using : driver.get(url) and then driver.page_source. but i dont want to wait till page load..i want only the source code – Yoav Cohen Dec 05 '16 at 10:36

3 Answers3

0

If you only want the source code, you don't actually need anything Selenium does, and therefore Selenium will only get in your way. Scrape the URL with selenium, and then do a simple HTTP GET (e.g. with curl or wget, or whatever builtin functions in a programming language such as the urllib2 or requests library in Python)

If you still want to do some complex parsing of the HTML, look at BeautifulSoup or LXML.

TimoV
  • 51
  • 1
  • 3
  • I wish that simple http get request was working but its not cause the site is protected with some javascripts i think. http://stackoverflow.com/questions/40710396/requesting-web-page-with-python – Yoav Cohen Dec 05 '16 at 11:32
  • Protected using authentication? Or protected from scraping? If it is protected from scraping, you should wait until it is loaded entirely. That way you are sure that any scramling that was going on is resolved -> use Selenium, wait for page to be loaded, look at source. Sidenote: you can disable certain plugins in your driver settings. For exaple, I use `preferences.put("plugins.plugins_disabled", new String[]{ "Adobe Flash Player", "Chrome PDF Viewer"});` a lot. Similar stuff exists for most drivers and most languages. – TimoV Dec 05 '16 at 14:22
0

There are few possible solutions:

1) As you didn't clarified what you mean urllib didnt work for this website, you can try to use python-requests library instead:

Use pip install requests with cmd/Terminal

url = "http://shironet.mako.co.il/artist?type=lyrics&lang=1&prfid=202&wrkid=2473"
page_source = requests.get(url).content

2) Try to disable media files auto playback with Firefox Preferences:

from selenium.webdriver.firefox.firefox_profile import FirefoxProfile

profile.set_preference("media.autoplay.enabled", False);
driver = webdriver.Firefox(profile)

3) More rude method is to disable Javascript on page (I'm not sure that you actually might need this for described purpose)

from selenium.webdriver.firefox.firefox_profile import FirefoxProfile

profile.set_preference("javascript.enabled", False);
driver = webdriver.Firefox(profile)

But be careful as it can remove some required media files from page source

Andersson
  • 51,635
  • 17
  • 77
  • 129
0

I know this was asked long ago and you probably don't even need any help anymore, but I was facing a similar problem, and I found a solution, not the most sofisticated, but it works fine. Try setting a timeout, so you don't need to wait for the page to fully load. Like this:

from selenium import webdriver

link = "https://somewebsite.com"
timeout = 30 # Read note below

driver = webdriver.Chrome()
driver.set_page_load_timeout(timeout)
try:
   driver.get(link)
except: # The timeout we set throws an exception when the time runs out
   driver.execute_script("window.stop();") # So we need to handle it!
   print("Information already extracted, no need to wait!")

IMPORTANT: The timeout needs adjustments, you are going to need to test what time it takes to properly get what you want before the window gets closed, just change the timeout variable, untill it works as you want.

But I couldn't find something that does this "automatically" like I wanted and you probably wanted too.

Farlitz
  • 23
  • 6