Ruby Script unable to gather data

Question

#!/usr/bin/ruby
# Fetches all Virginia Tech classes from the timetable and spits them out into a nice JSON object
# Can be run with option of which file to save output to or will save to classes.json by default
require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'json'

#Create Mechanize Browser and Class Data hash to load data into
agent = Mechanize.new
classData = Hash.new

#Get Subjects from Timetable page
page = agent.get("https://banweb.banner.vt.edu/ssb/prod/HZSKVTSC.P_ProcRequest")
subjects = page.forms.first.field_with(:name => 'subj_code').options

#Loop subjects
subjects.each do |subject|

#Get the Timetable Request page & Form
timetableSearch = agent.get("https://banweb.banner.vt.edu/ssb/prod/HZSKVTSC.P_ProcRequest")
searchDetails = page.forms.first

#Submit with specific subject 
searchDetails.set_fields({
    :SUBJ_CODE => subject,
    :TERMYEAR => '201401',
    :CAMPUS => 0
})

#Submit the form and store results into courseListings
courseListings = Nokogiri::HTML(
    searchDetails.submit(searchDetails.buttons[0]).body
)

#Create Array in Hash to store all classes for subjects
classData[subject] = [] 

#For every Class
courseListings.css('table.dataentrytable/tr').collect do |course|

    subjectClassesDetails = Hash.new

    #Map Table Cells for each course to appropriate values
    [
        [ :crn, 'td[1]/p/a/b/text()'],
        [ :course, 'td[2]/font/text()'],
        [ :title, 'td[3]/text()'],
        [ :type, 'td[4]/p/text()'],
        [ :hrs, 'td[5]/p/text()'],
        [ :seats, 'td[6]/text()'],
        [ :instructor, 'td[7]/text()'],
        [ :days, 'td[8]/text()'],
        [ :begin, 'td[9]/text()'],
        [ :end, 'td[10]/text()'],
        [ :location, 'td[11]/text()'],
    #   [ :exam, 'td[12]/text()']
    ].collect do |name, xpath|
        #Not an additional time session (2nd row)
        if (course.at_xpath('td[1]/p/a/b/text()').to_s.strip.length > 2)
            subjectClassesDetails[name] = course.at_xpath(xpath).to_s.strip
        end
    end

    #Add class to Array for Subject!
    classData[subject].push(subjectClassesDetails)
end
end

#Write Data to JSON file
open(ARGV[0] || "classes.json", 'w') do |file| 
file.print JSON.pretty_generate(classData)
end

The above code is supposed to retrieve data from https://banweb.banner.vt.edu/ssb/prod/HZSKVTSC.P_ProcRequest but if I print subjects.length is prints 0 so it clearly isn't getting the correct data. The given term code "201401" is definitely the right one.

I've noticed that when I manually enter in the link to my browser the subject field doesn't allow you to select an option until a term is selected, however when I view the page source the data is clearly already there. What can I do to retrieve this data?

score 0 · Answer 1 · edited May 23 '17 at 12:03

I'm looking at that vtech page and I can see that you need to select a TERMYEAR first before the subj_code dropdown fills allowing you to get the options. Unfortunately this happens with javascript in function dropdownlist(listindex). Mechanize doesn't handle javascript so this script is doomed to fail.

Your options are to run a browser automator like Watir or Selenium: discussed here: How do I use Mechanize to process JavaScript?

Or to read the source of that page and parse out the values of these lines:

document.ttform.subj_code.options[0]=new Option("All Subjects","%",false, false);
document.ttform.subj_code.options[1]=new Option("AAEC - Agricultural and Applied Economics","AAEC",false, false);
document.ttform.subj_code.options[2]=new Option("ACIS - Accounting and Information Systems","ACIS",false, false);

To get the options. You could do that by simply using open-uri:

require 'open-uri'
page = open("https://banweb.banner.vt.edu/ssb/prod/HZSKVTSC.P_ProcRequest")
page_source = page.read

Now you can use a regex to scan for all the options:

page_source.scan /document\.ttform.+;/

That'll give you an array with all the lines that have the javascript codes that contain the options. Craft your regex a little better and you can extract the option text from those. I'll see if I can come up with something for that and I'll post back. Hopefully this will get you headed in the right direction.

I'm back. I was able to parse out all the subj_code options with this regex:

subjects = page_source.scan(/Option\("(.*?)"/).uniq # remove duplicates
subjects.shift # get rid of the first option because it's just "All Subjects"
subjects.size == 137

Hope that helps.

Theres no way to select the term that I want by changing the value of the form and then let it reload itself? — Bryce Langlotz, Feb 11 '14 at 18:19
Nope, looking at the page source i can see that the subj_code only gets populated via javascript. — DiegoSalazar, Feb 11 '14 at 18:20
And as mentioned, Mechanize doesn't handle javascript. You won't be able to get this script working with it. You'll have to switch to one of the aforementioned gems. — DiegoSalazar, Feb 11 '14 at 18:26

Ruby Script unable to gather data

1 Answers1