0

I need to extract phone# and website links along with the name and country of the Universities from a website. The website is https://www.whed.net/results_institutions.php?Chp2=Business%20Administration and the problem is there is a + sign which needs to be clicked for every university then data needs to be extracted, it needs to close and move on to the next one.

I have tried multiple ways through selenium as follows:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup
import pandas as pd

#opening the web browser
browser = webdriver.Chrome('C:\\Users\\albert.malhotra\\Desktop\\Web Scrapings\\Kentucky State\\chromedriver')

#assigning the link to a variable
url = 'https://www.whed.net/results_institutions.php?Chp2=Business%20Administration'

#opening the url in browser while waiting 10 seconds for it to load
browser.get(url)
dfs = []
dfss = []
for n in range(50):
    html = browser.page_source
    soup = BeautifulSoup(html, 'lxml')

    for data in soup.find_all('p' , {'class' : 'country'}):
        item = data.text

        for thead in soup.find_all('div', {'class' : 'details'}):
            #data_2 = thead.find_all('a')
            data_2 = thead.select('h3')


            browser.find_element_by_link_text('More details').click()
            html_2 = browser.page_source
            soup_1 = BeautifulSoup(html_2, 'lxml')
            name = []
            for phone in soup_1.find_all('span' , {'class' : 'contenu'}):
                data_3 = phone.text
                name.append(data_3)
            browser.find_element_by_class_name("fancybox-item fancybox-close").click()
            dfss.append(data_2[0].text)
            dfs.append(item)
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
Albert
  • 29
  • 5

3 Answers3

1

If you observe the code carefully, the + symbol opens up a URL in a pop up. So in this case instead of clicking + button and then traversing the pop up, it would be easy to open the URL of the pop up and then traverse the page. Here is the code to do it.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By


siteURL = "https://www.whed.net/results_institutions.php?Chp2=Business%20Administration"
browser = webdriver.Chrome(executable_path='chromedriver.exe')
browser.get((siteURL))

#this will return all the URL's of popups in an array
search = browser.find_elements_by_class_name('fancybox');
#For test purpose I used only first link
print (search[0].get_attribute("href"))
#This opens the page that comes in first pop up. Just parse the source code and get your data.
browser.get(search[0].get_attribute("href"))
#You can run a loop loop to traverse the complete array of URL's.

To get the number of URL's you can use length property of array.

Ram
  • 319
  • 1
  • 2
  • 16
  • If i use get_attribute() feature in a loop it does allow me scrape the link and allows me to visit the pages however the majority of links that it gives in output are not required to be scraped and this happens for all the web pages. So if I only want to get let's say first 50 or 100 links I should be using length property of array? I am not exactly sure how to do this. – Albert Mar 07 '19 at 19:14
  • You can get the length of array using `trs = browser.find_elements_by_class_name("type-fancybox") length = len(trs)` This gives length of total type-fancy box. when you run the loop restrict it to the number you need. To loop everything you will use `for x in range(length):` do scrapping of x element where 'x; is each element. So in case you want to use less than the total length use. `for x in range(length-number.you.want):` – Ram Mar 07 '19 at 19:29
1

You don't necessarily need selenium. You can use requests certainly for a large result set. The page retrieves data via the server which runs a SQL query which has a records count parameter you can adjust for the number of results, nbr_ref_pge, you want. You can write a POST request that passes the necessary info that is later fed to the SQL query. Now you can calculate how that might look in batches to get the total number you need and see if there is an offset to allow for this.

I am not experienced enough with asyncio but suspect that would be a good way to go as request count is high to individual site pages. My attempt just with Session is show. I took the retry syntax from an answer by @datashaman

import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
from requests.packages.urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter

baseUrl = 'https://www.whed.net/'
searchTerm = 'Business Administration'
headers = {'Accept': 'application/json'}
params = {'Chp2' : searchTerm}
url = 'https://www.whed.net/results_institutions.php'
data = {
    'where': "(FOS LIKE '%|" + searchTerm + "|%')",
    'requete' : '(Fields of study=' + searchTerm + ')',
    'ret' : 'home.php',
    'afftri' : 'yes',
    'stat' : 'Fields of study',
    'sort' : 'InstNameEnglish,iBranchName',
    'nbr_ref_pge' : '1000'
}

results = []

with requests.Session() as s:
    retries = Retry(total=5,
                backoff_factor=0.1,
                status_forcelist=[ 500, 502, 503, 504 ])

    s.mount('http://', HTTPAdapter(max_retries=retries))
    res = s.post(url, params = params, headers = headers, data = data)
    soup = bs(res.content, 'lxml')
    links = set([baseUrl + item['href'] for item in soup.select("[href*='detail_institution.php?']")])

    for link in links:
        res = s.get(link)  
        soup = bs(res.content, 'lxml')
        items = soup.select('#contenu span')
        name = soup.select_one('#contenu h2').text.strip()
        country = soup.select_one('.country').text.strip()
        i = 0
        for item in items:
            if 'Tel.' in item.text:
                phone = items[i+1].text
            if 'WWW:' in item.text:
                website = items[i+1].text
            i+=1
        results.append([name, country, phone, website])
        name = country = phone = website = ''
df = pd.DataFrame(results)
QHarr
  • 83,427
  • 12
  • 54
  • 101
0

To extract website links of the Universities from the website you won't need BeautifulSoup and Selenium can extract the required data easily following the solution below:

  • Code Block:

    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
    options = webdriver.ChromeOptions()
    options.add_argument('start-maximized')
    options.add_argument('disable-infobars')
    options.add_argument('--disable-extensions')
    driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
    driver.get('https://www.whed.net/results_institutions.php?Chp2=Business%20Administration')
    elements = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a.detail.fancybox[title='More details']")))
    for element in elements:
        WebDriverWait(driver, 30).until(EC.visibility_of(element)).click()
        WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe.fancybox-iframe")))
        print(WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.lien"))).get_attribute("innerHTML"))
        driver.switch_to_default_content()
        driver.find_element_by_css_selector("a.fancybox-item.fancybox-close").click()
    driver.quit()
    
  • Console Output:

    http://www.uni-ruse.bg
    http://www.vspu.hr
    http://www.vfu.bg
    http://www.uni-svishtov.bg
    http://www.universitateagbaritiu.ro
    http://www.shu-bg.net
    http://universityecotesbenin.com
    http://www.vps-libertas.hr
    http://www.swu.bg
    http://www.zrinski.org/nikola
    

Note: The rest of the items phone, name and country can be easily extracted now.

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352