1

So for a project, I'm working on creating an API to interface with my School's course-finder and I'm struggling to grab the data from the a HTML table they store the data in without using Selenium. I was able to pull the HTML data initially using Selenium but my Instructor says he would prefer if I used BeautifulSoup4 & MechanicalSoup libraries. I got as far as submitting a search and grabbing the HTML table the data is stored in. I'm not sure how to iterate through the data stored in the HTML table as I did with my Selenium code below.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.options import Options

Chrome_Options = Options()
Chrome_Options.add_argument("--headless") #allows program to run without opening a chrome window

driver = webdriver.Chrome() 
driver.get("https://winnet.wartburg.edu/coursefinder/") #sets the Silenium driver

select = Select(driver.find_element_by_id("ctl00_ContentPlaceHolder1_FormView1_DropDownList_Term"))
term_options = select.options
#for index in range(0, len(term_options) - 1):
#    select.select_by_index(index)


lst = []

DeptSelect = Select(driver.find_element_by_id("ctl00_ContentPlaceHolder1_FormView1_DropDownList_Department")) 
DeptSelect.select_by_visible_text("History") #finds the desiered department

search = driver.find_element_by_name("ctl00$ContentPlaceHolder1$FormView1$Button_FindNow")
search.click() #sends query

table_id = driver.find_element_by_id("ctl00_ContentPlaceHolder1_GridView1")
rows = table_id.find_elements_by_tag_name("tr")
for row in rows: #creates a list of lists containing our data
    col_lst = []
    col = row.find_elements_by_tag_name("td")
    for data in col:
        lst.append(data.text)

def chunk(l, n): #class that partitions our lists neatly
    print("chunking...")
    for i in range(0, len(l), n):
        yield l[i:i + n]

n = 16 #each list contains 16 items regardless of contents or search
uberlist = list(chunk(lst, n)) #call chunk fn to partion list

with open('class_data.txt', 'w') as handler: #output of scraped data
    print("writing file...")
    for listitem in uberlist:
        handler.write('%s\n' % listitem)

driver.close #ends and closes Silenium control over brower

This is my Soup Code and I'm wondering how I can take the data from the HTML in a similar way I did above with my Selenium.

import mechanicalsoup
import requests
from lxml import html
from lxml import etree
import pandas as pd

def text(elt):
    return elt.text_content().replace(u'\xa0', u' ')

#This Will Use Mechanical Soup to grab the Form, Subit it and find the Data Table
browser = mechanicalsoup.StatefulBrowser()
winnet = "http://winnet.wartburg.edu/coursefinder/"
browser.open(winnet)
Searchform = browser.select_form()
Searchform.choose_submit('ctl00$ContentPlaceHolder1$FormView1$Button_FindNow')
response1 = browser.submit_selected() #This Progresses to Second Form
dataURL = browser.get_url() #Get URL of Second Form w/ Data
dataURL2 = 'https://winnet.wartburg.edu/coursefinder/Results.aspx'

pageContent=requests.get(dataURL2)
tree = html.fromstring(pageContent.content)
dataTable = tree.xpath('//*[@id="ctl00_ContentPlaceHolder1_GridView1"]')
rows = [] #initialize a collection of rows
for row in dataTable[0].xpath(".//tr")[1:]: #add new rows to the collection
    rows.append([cell.text_content().strip() for cell in row.xpath(".//td")])

df = pd.DataFrame(rows) #load the collection to a dataframe
print(df)
#XPath to Table
#//*[@id="ctl00_ContentPlaceHolder1_GridView1"]
#//*[@id="ctl00_ContentPlaceHolder1_GridView1"]/tbody
Rob
  • 403
  • 9
  • 19
  • Use `xpaths` for scraping the data inside a table. Then iterate the whole data with `for loop`. This will solve your problem, as I was also having the same problem but by doing this, I figured that out. –  Jan 26 '20 at 11:18
  • 1
    Do you have any tutorials or examples to show me? I've never worked with XPath before so I'm at a loss how to properly use XPath to grab the table in the HTML and then parse it. – Rob Jan 26 '20 at 23:55
  • Oh it's easy. You scrape the elements on the page by `inspecting` them and get the `id` or `class`. But in tables, it's not possible to get that classes and id's. So just click on the table and inspect it with your cursor then you will get something like this `xxxx'` then just right click on that piece of code and there will be many options available. Just click on copy and then click on copy XPATH. –  Jan 27 '20 at 02:20
  • You can also check this link: `https://stackoverflow.com/questions/3030487/is-there-a-way-to-get-the-xpath-in-google-chrome` . If you still getting any problem, then you can ping me out and I will now without talking anything else, will give you the code in the answer. @Robert Farmer –  Jan 27 '20 at 02:22
  • 1
    I've updated my MechanicalSoup code and I'm getting an empty DataFrame... I'm not sure if my code is in error or if since the page gets populated by the results of the first page's form submission it doesn't keep the submitted results – Rob Jan 27 '20 at 17:07
  • Why don't you use `Beautiful Soup` instead. Is it compulsory for you to use mehanical soup or you can choose beautiful soup also? If you want to use beautiful soup then i can give you the code, cauz i don't know mechanical soup... –  Jan 28 '20 at 02:27

1 Answers1

1

Turns out I was able passing the wrong thing when using MechanicalSoup. I was able to pass the new page's contents to a variable called table had the page use .find('table') to retrieve the table HTML rather than the full page's HTML. From there just used table.get_text().split('\n') to make essentially a giant list of all of the rows.

I also dabble with setting form filters which worked as well.

import mechanicalsoup
from bs4 import BeautifulSoup

#Sets StatefulBrowser Object to winnet then it it grabs form
browser = mechanicalsoup.StatefulBrowser()
winnet = "http://winnet.wartburg.edu/coursefinder/"
browser.open(winnet)
Searchform = browser.select_form()

#Selects submit button and has filter options listed.

Searchform.choose_submit('ctl00$ContentPlaceHolder1$FormView1$Button_FindNow')
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$TextBox_keyword', "") #Keyword Searches by Class Title. Inputting string will search by that string ignoring any stored nonsense in the page.
#ACxxx Course Codes have 3 spaces after them, THIS IS REQUIRED. Except the All value for not searching by a Department does not.
Searchform.set("ctl00$ContentPlaceHolder1$FormView1$DropDownList_Department", 'All') #For Department List, it takes the CourseCodes as inputs and displays as the Full Name
Searchform.set("ctl00$ContentPlaceHolder1$FormView1$DropDownList_Term", "2020 Winter Term") # Term Dropdown takes a value that is a string. String is Exactly the Term date.
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_MeetingTime', 'all') #Takes the Week Class Time as a String. Need to Retrieve list of options from pages
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_EssentialEd', 'none') #takes a small string signialling the EE req or 'all' or 'none'. None doesn't select and option and all selects all coruses w/ a EE
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_CulturalDiversity', 'none')# Cultural Diversity, Takes none, C, D or all
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_WritingIntensive', 'none') # options are none or WI
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_PassFail', 'none')# Pass/Faill takes 'none' or 'PF'
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$CheckBox_OpenCourses', False) #Check Box, It's True or False
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_Instructor', '0')# 0 is for None Selected otherwise it is a string of numbers (Instructor ID?)

#Submits Page, Grabs results and then launches a browser for test purposes.
browser.submit_selected()# Submits Form. Retrieves Results.
table = browser.get_current_page().find('table') #Finds Result Table
print(type(table))
rows = table.get_text().split('\n') # List of all Class Rows split by \n. 
Rob
  • 403
  • 9
  • 19