I'm using selenium
with Python 2.7. to retrieve the contents from a search box on a webpage. The search box dynamically retrieves and displays the results in the box itself.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
import re
from time import sleep
driver = webdriver.Firefox()
driver.get(url)
df = pd.read_csv("read.csv")
def crawl(isin):
searchkey = driver.find_element_by_name("searchkey")
searchkey.clear()
searchkey.send_keys(isin)
sleep(11)
search_result = driver.find_element_by_class_name("ac_results")
names = re.match(r"^.*(?=(\())", search_result.text).group().encode("utf-8")
product_id = re.findall(r"((?<=\()[0-9]*)", search_result.text)
return pd.Series([product_id, names])
df[["insref", "name"]] = df["ISIN"].apply(crawl)
print df
Relevant part of the code may be found under def crawl(isin):
- The program enters what to search for in the search box (the box is badly named as
searchkey
). - It then does
sleep()
and waits for the content to show in the search box dropdown fieldac_results
. - Then gets two variables
insrefs
andnames
with Regex.
Instead of calling sleep()
, I would like for it to wait for the content in the WebElement ac_results
to load.
Since it will continuously use the search box to get new data by entering new search terms from a list, one could perhaps use Regex to identify when there is new content in ac_results
that is not identical to the previous content.
Is there a method for this? It is important to note that the content in the search box is dynamically loaded, so the function would have to recognise that something has changed in the WebElement.