19

So I'm trying to login to Quora using Python and then scrape some stuff.

I'm using Selenium to login to the site. Here's my code:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Firefox()
driver.get('http://www.quora.com/')

username = driver.find_element_by_name('email')
password = driver.find_element_by_name('password')

username.send_keys('email')
password.send_keys('password')
password.send_keys(Keys.RETURN)

driver.close()

Now the questions:

  1. It took ~4 minutes to find and fill the login form, which painfully slow. Is there something I can do to speed up the process?

  2. When it did login, how do I make sure there were no errors? In other words, how do I check the response code?

  3. How do I save cookies with selenium so I can continue scraping once I login?

  4. If there is no way to make selenium faster, is there any other alternative for logging in? (Quora doesn't have an API)

Oleksandr Makarenko
  • 779
  • 1
  • 6
  • 18
KGo
  • 18,536
  • 11
  • 31
  • 47
  • 2
    Which lines are taking the time? – Vince Bowdren Jul 04 '13 at 09:53
  • @vincebowdren Almost all of them. The browser opens up just fine, but then finding the fields, and filling them takes about a minute each. – KGo Jul 04 '13 at 10:30
  • @user1177636 Yes. Works just fine on Google. Must be an issue with quora. – KGo Jul 04 '13 at 10:38
  • Using Quora and the latest Selenium C# API, it is fast for me. – Arran Jul 04 '13 at 11:16
  • 1
    How fast? Because I've tried on 3 machines with the Python API and it's so damn slow. – KGo Jul 04 '13 at 11:42
  • @Arran: I can still reproduce it with Firefox + Python/C#2.33.0 bindings. `driver.Navigate().GoToUrl("http://www.quora.com/physics");Thread.Sleep(3000);var source = WebDriver.PageSource;` Will get exception. – Yi Zeng Jul 04 '13 at 22:04
  • @KaranGoel have you managed to scrape something from Quora with python? – Stanpol Nov 24 '13 at 21:51
  • @Stanpol I can scrape the public content (question title, top answer) but everything else is a mess. – KGo Nov 25 '13 at 04:38
  • Im having the same problem with Quora. Selenium takes forever to do simple tasks. I guess they dont want us to scrape them. – Laraconda Jul 05 '16 at 23:40
  • I tried headless chrome driver, however, `driver.get('http://www.quora.com/')` still runs slowly and it takes nearly 4 minutes. @KaranGoel Have you found a solution to speed up the process? Thanks! – mathsyouth Jul 19 '18 at 03:52

7 Answers7

20

I had a similar problem with very slow find_elements_xxx calls in Python selenium using the ChromeDriver. I eventually tracked down the trouble to a driver.implicitly_wait() call I made prior to my find_element_xxx() calls; when I took it out, my find_element_xxx() calls ran quickly.

Now, I know those elements were there when I did the find_elements_xxx() calls. So I cannot imagine why the implicit_wait should have affected the speed of those operations, but it did.

Polly
  • 549
  • 5
  • 11
  • 1
    That really helped me as I switched to the WebDriverWait method and completly forgot about this call. Thanks ! – rak007 Jul 11 '17 at 23:07
  • omg i put that in to test a while ago and forgot it was in there. i've been wondering why it's been taking so goddamn long to run. ty so much <33333 – oldboy Jun 28 '18 at 03:23
  • In Protractor, I had a page that was suddenly taking several minutes to interact with, probably due to JavaScript exceptions in code that another team pushed... I had been using browser.driver.manage().timeouts().implicitlyWait(10000) in my onPrepare without issues, but when I took it out, now Selenium was able to interact with this page efficiently. Thank you @Polly!!! – emery Sep 24 '19 at 15:56
3
  1. I have been there, selenium is slow. It may not be as slow as 4 min to fill a form. I then started using phantomjs, which is much faster than firefox, since it is headless. You can simply replace Firefox() with PhantomJS() in the webdriver line after installing latest phantomjs.

  2. To check that you have login you can assert for some element which is displayed after login.

  3. As long as you do not quit your driver, cookies will be available to follow links

  4. You can try using urllib and post directly to the login link. You can use cookiejar to save cookies. You can even simply save cookie, after all, a cookie is simply a string in http header

manish
  • 67
  • 6
  • 1
    1. PhantomJS is much faster for sure (I still took 38 seconds). But I want to be able to see what the script is doing in the browser before I switch to headless browser. 2. `assert "Home" in driver.title` gave me `AssertionError`. 4. I can try that for sure. – KGo Jul 04 '13 at 10:56
  • Install latest version of phantomjs available through their website, not apt-get. version should be 1.9.1 – manish Jul 05 '13 at 05:23
  • Yes that's what I did. Downloaded the latest from their website, placed it in the same folder as my program and got this error. The file I downloaded was `phantonjs` (no extension) – KGo Jul 05 '13 at 06:58
  • Karan, you have to place it in a folder that is there in $PATH variable. It may not work if you place it in current folder unless your $PATH includes . – manish Jul 18 '13 at 06:02
3

You can fasten your form filling by using your own setAttribute method, here is code for java for it

public void setAttribute(By locator, String attribute, String value) {
    ((JavascriptExecutor) getDriver()).executeScript("arguments[0].setAttribute('" + attribute
            + "',arguments[1]);",
            getElement(locator),
            value);
}
Stormy
  • 541
  • 4
  • 9
  • Can you explain what this does and how it makes the script faster? – KGo Jul 06 '13 at 02:04
  • You can just execute setAttribute(FindBy*****(your locator here), "value", "Text you want to be put in the field); and it will set the HTML attribute "value" to text you want to fill in the field. Basically there is a timeout on the send_keys operation, my method bypasses this by doing in JS-injection into your page to assign your text to the field, this would be done very fast. – Stormy Jul 08 '13 at 09:52
  • omg super sweet!!! i'm assuming this would be accomplished with `browser.execute_script(' // javascript goes here ')` with python?? – oldboy Jun 28 '18 at 03:25
2

Running the web driver headlessly should improve its execution speed to some degree.

from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options

options = Options()
options.add_argument('-headless')
browser = webdriver.Firefox(firefox_options=options)

browser.get('https://google.com/')
browser.close()
oldboy
  • 5,729
  • 6
  • 38
  • 86
1

For Windows 7 and IEDRIVER with Python Selenium, Ending the Windows Command Line and restarting it cured my issue.

I was having trouble with find_element..clicks. They were taking 30 seconds plus a little bit. Here's the type of code I have including capturing how long to run.

timeStamp = time.time()
elem = driver.find_element_by_css_selector(clickDown).click()
print("1 took:",time.time() - timeStamp)

timeStamp = time.time()
elem = driver.find_element_by_id("cSelect32").click()
print("2 took:",time.time() - timeStamp)

That was recording about 31 seconds for each click. After ending the command line and restarting it (which does end any IEDRIVERSERVER.exe processes), it was 1 second per click.

1

I have changed locators and this works fast. Also, I have added working with cookies. Check the code below:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
import pickle


driver = webdriver.Firefox()
driver.get('http://www.quora.com/')
wait = WebDriverWait(driver, 5)
username = wait.until(EC.presence_of_element_located((By.XPATH, '//div[@class="login"]//input[@name="email"]')))
password = wait.until(EC.presence_of_element_located((By.XPATH, '//div[@class="login"]//input[@name="password"]')))

username.send_keys('email')
password.send_keys('password')
password.send_keys(Keys.RETURN)

wait.until(EC.presence_of_element_located((By.XPATH, '//span[text()="Add Question"]'))) # checking that user logged in
pickle.dump( driver.get_cookies() , open("cookies.pkl","wb")) # saving cookies
driver.close()

We have saved cookies and now we will apply them in a new browser:

driver = webdriver.Firefox()
driver.get('http://www.quora.com/')
cookies = pickle.load(open("cookies.pkl", "rb"))
for cookie in cookies:
    driver.add_cookie(cookie)
driver.get('http://www.quora.com/')

Hope, this will help.

Oleksandr Makarenko
  • 779
  • 1
  • 6
  • 18
0

If driver.get() is very slow, this answer is the fastest alternative, it takes the cookies and sessions from the webdriver and use it in requests to make get requests which is much more faster than of webdriver.

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument("headless")  # to stop opening new chrome browser on every hit
driver = webdriver.Chrome('chromedriver.exe', chrome_options=options)  # download chromedriver and give the location

...
...
This section might include extra webdriver settings like:
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.ID, 'nav-global-location-popover-link'))).click()
...
...

#creating requests session
s = requests.Session()
# Set correct user agent
selenium_user_agent = driver.execute_script("return navigator.userAgent;")
s.headers.update({"user-agent": selenium_user_agent})

#setting cookies of webdriver to requests
for cookie in driver.get_cookies():
    s.cookies.set(cookie['name'], cookie['value'], domain=cookie['domain'])

#get requests (much more faster than webdriver requests)
response = s.get('https://www.amazon.com/dp/B07F2LR8NX')

bs = BeautifulSoup(response.content, 'html.parser')

This requests.Session().get() is much more faster than driver.get()

Prakash Dahal
  • 4,388
  • 2
  • 11
  • 25