0

I'm trying to scrape the 'activity' text box from the two pages here and here.

I wrote the base of the code:

options = Options()
options.binary_location=r'C:\Program Files (x86)\Google\Chrome\Application\chrome.exe'
options.add_experimental_option('excludeSwitches', ['enable-logging'])
#options.add_argument("--headless")
driver = webdriver.Chrome(options=options,executable_path='/mnt/c/Users/kela/Desktop/selenium/chromedriver.exe


url = 'http://www.uwm.edu.pl/biochemia/biopep/peptide_data_page1.php?zm_ID=' + str(i) #where str(i) is either 2500 or 2700 in this example
driver.get(url)
header = driver.find_element_by_css_selector('[name="activity"]')
children = header.find_elements_by_xpath(".//*")

I have two issues:

  1. I need to only pull out the activity item that is 'option selected value', i don't want ALL the activities returned.
  2. BUT if the option is the first item in the list, as is the case with one of the pages shown here whose activity is 'aami'; 'selected value' is not an option as it's the default.

So I'm stuck on identifying a line or two of code that I could add to my script that would extract:

neuropeptide | ne
alpha-amylase inhibitor | aami

from these two web pages, if anyone could help.

Slowat_Kela
  • 1,377
  • 2
  • 22
  • 60

2 Answers2

1

You should check the attributes of option elements. If 'selected' attribute in any option, get it. If 'selected' attribute not in any option, get only first option.

I've implemented the finding attributes with BeautifulSoup. You can also implemenet with Selenium with executing Javascript code. Example here

My Code:

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Firefox()
url = 'http://www.uwm.edu.pl/biochemia/biopep/peptide_data_page1.php?zm_ID=2500'

driver.get(url)

header = driver.find_element_by_css_selector('[name="activity"]')
soup = BeautifulSoup(header.get_attribute("innerHTML"), 'html.parser')

options = soup.find_all('option')
for option in options:
    if 'selected' in option.attrs:
        print(option.text)
        break
else:
    print(options[0].text.strip())
Batuhan Gürses
  • 116
  • 1
  • 9
1

Use Select class and get the first_selected_option. You need to induce WebDriverWait And presence_of_element_located

i=2700
url = 'http://www.uwm.edu.pl/biochemia/biopep/peptide_data_page1.php?zm_ID=' + str(i) #where str(i) is either 2500 or 2700 in this example
driver.get(url)
element=WebDriverWait(driver,20).until(EC.presence_of_element_located((By.NAME,"activity")))
select=Select(element)
print(select.first_selected_option.text)

Output:

neuropeptide    |    ne

If you change the value to 2500 you will get alpha-amylase inhibitor | aami

Imports followings to execute above code.

from selenium.webdriver.support.select import Select
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
KunduK
  • 32,888
  • 5
  • 17
  • 41