optimize my python bank webscraper

Question

I am using Python 3.4 to make a webscraper that logins to my bank account, clicks into each account copying the balance , adding the total then pasting into google sheets.

I got it working but as you can see from the code, it is repetitive, ugly and long winded.

I have identified a few issues:

I believe I should be using a function to loop through the different account pages to get the balance and then assigning values to a different variable. However I couldn't think of a way of getting this done.

converting the string to float seems messy, what I am trying to do is to make a string ie. $1,000.00 into a float by stripping the '$' and ',' , is there a more elegant way?

from selenium import webdriver 
import time
import bs4
import gspread
from oauth2client.service_account import serviceAccountCredentials

driver = webdriver.Chrome()
driver.get(bank url) 


inputElement = driver.find_element_by_id("dUsername")
inputElement.send_keys('username')
pwdElement = driver.find_element_by_id("password")
pwdElement.send_keys('password')
driver.find_element_by_id('loginBtn').click()
time.sleep(3)

#copies saving account balance
driver.find_element_by_link_text('Savings').click()
time.sleep(3)
html = driver.page_source
soup = bs4.BeautifulSoup(html)
elems=soup.select('#CurrentBalanceAmount')
SavingsAcc = float(elems[0].getText().strip('$').replace(',',''))
driver.back()

#copy cheque balance
driver.find_element_by_link_text('cheque').click()
time.sleep(3)
html = driver.page_source
soup = bs4.BeautifulSoup(html)
elems=soup.select('#CurrentBalanceAmount')
ChequeAcc = float(elems[0].getText().strip('$').replace(',',''))
Total = SavingsAcc+ ChequeACC  
driver.back()

score 0 · Accepted Answer · edited May 23 '17 at 12:24

try the following code:

from selenium import webdriver 
import time
import bs4
import gspread
from oauth2client.service_account import serviceAccountCredentials

driver = webdriver.Chrome()
driver.get(bank url) 


inputElement = driver.find_element_by_id("dUsername")
inputElement.send_keys('username')
pwdElement = driver.find_element_by_id("password")
pwdElement.send_keys('password')
driver.find_element_by_id('loginBtn').click()
time.sleep(3)

def getBalance(accountType):
    driver.find_element_by_link_text(accountType).click()
    time.sleep(3)
    html = driver.page_source
    soup = bs4.BeautifulSoup(html)
    elems=soup.select('#CurrentBalanceAmount')
    return float(elems[0].getText().strip('$').replace(',',''))

#copies saving account balance
SavingsAcc = getBalance('Savings')
driver.back()
#copy cheque balance
ChequeACC = getBalance('cheque')    

Total = SavingsAcc+ ChequeACC  
driver.back()

Made a method getBalance, where you have to pass the account type, which returns the balance amount.

Note: you can keep driver.back call in getBalance as per your convenience, but before return statement.

Related to converting string to float, I don't know any other better way apart from the existing logic. As it is now moved into a method, I hope now it won't trouble you much. there is float method, which converts string to float, but $, , are not accepted. more details here

Note: If #CurrentBalanceAmount value changes every time for different account types, you can parameterize like accountType.

score 0 · Answer 2 · edited May 23 '17 at 12:01

I would use several python idioms to clean up the code:

Wrap all code in functions
- Generally speaking, putting your code in functions makes it easier to read and follow
- When you run a python script (python foo.py), the python interpreter runs every line it can, in order, one by one. When it encounters a function definition, it only runs the definition line (def bar():), and not the code within the function.
- This article seems like a good place to get more info on it: Understanding Python's Execution Model
Use the if __name__ == "__main__": idiom to make it an importable module
- Similar to the above bullet, this gives you more control on how and when your code executes, how portable it is, and how reusable it is.
- "Importable module" means you can write your code in one file, and then import that code in another module.
- More info on if __name__ == "__main__" here: What does if name == “main”: do?
Use try/finally to make sure your driver instances get cleaned up
Use explicit waits to interact with the page so you don't need to use sleep
- By default, Selenium tries to find and return things immediately. If the element hasn't loaded yet, Selenium throws an exception because it isn't smart enough to wait for it to load.
- Explicit waits are built into Selenium, and allow your code to wait for an element to load into the page. By default it checks every half a second to see if the element loaded in. If it hasn't, it simply tries again in another half second. If it has, it returns the element. If it doesn't ever load in, the Wait object throws a TimeoutException.
- More here: Explicit and Implicit Waits
- And here: WAIT IN SELENIUM PYTHON

Code (untested for obvious reasons):

from selenium import webdriver
from explicit import waiter, ID  # This package makes explicit waits easier to use
                                 # pip install explicit
from selenium.webdriver.common.by import By

# Are any of these needed?
# import time
# import bs4
# import gspread
# from oauth2client.service_account import serviceAccountCredentials


def bank_login(driver, username, password):
    """Log into the bank account"""
    waiter.find_write(driver, 'dUsername', username, by=ID)
    waiter.find_write(driver, 'password', password, by=ID, send_enter=True)


def get_amount(driver, source):
    """Click the page and scrape the amount"""
    # Click the page in question
    waiter.find_element(driver, source, by=By.LINK_TEXT).click()

    # Why are you using beautiful soup? Because it is faster?
    # time.sleep(3)
    # html = driver.page_source
    # soup = bs4.BeautifulSoup(html)
    # elems=soup.select('#CurrentBalanceAmount')
    # SavingsAcc = float(elems[0].getText().strip('$').replace(',',''))
    # driver.back()

    # I would do it this way:
    # When using explicit waits there is no need to explicitly sleep
    amount_str = waiter.find_element(driver, "CurrentBalanceAmount", by=ID).text
    # This conversion scheme will handle none $ characters too
    amount = float("".join([char for char in amount_str if char in ["1234567890."]]))

    driver.back()

    return amount


def main():
    driver = webdriver.Chrome()
    try:
        driver.get(bank_url)
        bank_login(driver, 'username', 'password')
        print(sum([get_amount(driver, source) for source in ['Savings', 'cheque']]))

    finally:
        driver.quit()  # Use this try/finally idiom to prevent a bunch of dead browsers instances


if __name__ == "__main__":
    main()

Full disclosure: I maintain the explicit package. You could replace the waiter calls above with relatively short Wait calls if you would prefer. If you are using Selenium with any regularity it is worth investing the time to understand and use explicit waits.

@LeviNoeker This is essentially my first python code after learning from a book, so the tools I know are limited. Could you expand a bit more on your points? What does `if __name__ == "__main__":` do and what is an importable module? What are explicit waits and how are they different from sleep? — fidr, Jan 02 '17 at 17:35
@fidr Added some additional info to the bullets. Hope it helps :-) — Levi Noecker, Jan 02 '17 at 20:25

optimize my python bank webscraper

2 Answers2