15

My code successfully scrapes the tr align=center tags from [ http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY ] and writes the td elements to a text file.

However, there are multiple pages available at the site above in which I would like to be able to scrape.

For example, with the url above, when I click the link to "page 2" the overall url does NOT change. I looked at the page source and saw a javascript code to advance to the next page.

How can my code be changed to scrape data from all the available listed pages?

My code that works for page 1 only:

import bs4
import requests 

response = requests.get('http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY')

soup = bs4.BeautifulSoup(response.text)
soup.prettify()

acct = open("/Users/it/Desktop/accounting.txt", "w")

for tr in soup.find_all('tr', align='center'):
    stack = []
    for td in tr.findAll('td'):
        stack.append(td.text.replace('\n', '').replace('\t', '').strip())

    acct.write(", ".join(stack) + '\n')
Philip McQuitty
  • 1,077
  • 9
  • 25
  • 36
  • It is not really poosible with requests or any other go fetch html stuff tool, if you want to do that you have to go with a web driver like selenium or WebDriver, but it is way more complicated that request.. good luck – brunsgaard Oct 21 '14 at 23:03
  • It's just simple URL manipulation, really. Just check the `POST` requests using Google Chrome's inspection tool or Firebug for Firefox. See my answer below. – WGS Oct 21 '14 at 23:05
  • @Nanashi, you should maybe explain how to do what you suggest in your answer – Padraic Cunningham Oct 21 '14 at 23:06
  • Will do, mate. Just adding code as well. :) – WGS Oct 21 '14 at 23:08
  • Guys, btw, thank you both for keeping web-scraping tag in shape! :) – alecxe Oct 21 '14 at 23:09
  • Hello, @alecxe. You've always soldiered on with this stuff, time to practice my chops agian in this tag! – WGS Oct 21 '14 at 23:18

1 Answers1

53

The trick here is to check the requests that are coming in and out of the page-change action when you click on the link to view the other pages. The way to check this is to use Chrome's inspection tool (via pressing F12) or installing the Firebug extension in Firefox. I will be using Chrome's inspection tool in this answer. See below for my settings.

enter image description here

Now, what we want to see is either a GET request to another page or a POST request that changes the page. While the tool is open, click on a page number. For a really brief moment, there will only be one request that will appear, and it's a POST method. All the other elements will quickly follow and fill the page. See below for what we're looking for.

enter image description here

Click on the above POST method. It should bring up a sub-window of sorts that has tabs. Click on the Headers tab. This page lists the request headers, pretty much the identification stuff that the other side (the site, for example) needs from you to be able to connect (someone else can explain this muuuch better than I do).

Whenever the URL has variables like page numbers, location markers, or categories, more often that not, the site uses query-strings. Long story made short, it's similar to an SQL query (actually, it is an SQL query, sometimes) that allows the site to pull the information you need. If this is the case, you can check the request headers for query string parameters. Scroll down a bit and you should find it.

enter image description here

As you can see, the query string parameters match the variables in our URL. A little bit below, you can see Form Data with pageNum: 2 beneath it. This is the key.

POST requests are more commonly known as form requests because these are the kind of requests made when you submit forms, log in to websites, etc. Basically, pretty much anything where you have to submit information. What most people don't see is that POST requests have a URL that they follow. A good example of this is when you log-in to a website and, very briefly, see your address bar morph into some sort of gibberish URL before settling on /index.html or somesuch.

What the above paragraph basically means is that you can (but not always) append the form data to your URL and it will carry out the POST request for you on execution. To know the exact string you have to append, click on view source.

enter image description here

Test if it works by adding it to the URL.

enter image description here

Et voila, it works. Now, the real challenge: getting the last page automatically and scraping all of the pages. Your code is pretty much there. The only things remaining to be done are getting the number of pages, constructing a list of URLs to scrape, and iterating over them.

Modified code is below:

from bs4 import BeautifulSoup as bsoup
import requests as rq
import re

base_url = 'http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY'
r = rq.get(base_url)

soup = bsoup(r.text)
# Use regex to isolate only the links of the page numbers, the one you click on.
page_count_links = soup.find_all("a",href=re.compile(r".*javascript:goToPage.*"))
try: # Make sure there are more than one page, otherwise, set to 1.
    num_pages = int(page_count_links[-1].get_text())
except IndexError:
    num_pages = 1

# Add 1 because Python range.
url_list = ["{}&pageNum={}".format(base_url, str(page)) for page in range(1, num_pages + 1)]

# Open the text file. Use with to save self from grief.
with open("results.txt","wb") as acct:
    for url_ in url_list:
        print "Processing {}...".format(url_)
        r_new = rq.get(url_)
        soup_new = bsoup(r_new.text)
        for tr in soup_new.find_all('tr', align='center'):
            stack = []
            for td in tr.findAll('td'):
                stack.append(td.text.replace('\n', '').replace('\t', '').strip())
            acct.write(", ".join(stack) + '\n')

We use regular expressions to get the proper links. Then using list comprehension, we built a list of URL strings. Finally, we iterate over them.

Results:

Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=1...
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=2...
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=3...
[Finished in 6.8s]

enter image description here

Hope that helps.

EDIT:

Out of sheer boredom, I think I just created a scraper for the entire class directory. Also, I update both the above and below codes to not error out when there is only a single page available.

from bs4 import BeautifulSoup as bsoup
import requests as rq
import re

spring_2015 = "http://my.gwu.edu/mod/pws/subjects.cfm?campId=1&termId=201501"
r = rq.get(spring_2015)
soup = bsoup(r.text)
classes_url_list = [c["href"] for c in soup.find_all("a", href=re.compile(r".*courses.cfm\?campId=1&termId=201501&subjId=.*"))]
print classes_url_list

with open("results.txt","wb") as acct:
    for class_url in classes_url_list:
        base_url = "http://my.gwu.edu/mod/pws/{}".format(class_url)
        r = rq.get(base_url)

        soup = bsoup(r.text)
        # Use regex to isolate only the links of the page numbers, the one you click on.
        page_count_links = soup.find_all("a",href=re.compile(r".*javascript:goToPage.*"))
        try:
            num_pages = int(page_count_links[-1].get_text())
        except IndexError:
            num_pages = 1

        # Add 1 because Python range.
        url_list = ["{}&pageNum={}".format(base_url, str(page)) for page in range(1, num_pages + 1)]

        # Open the text file. Use with to save self from grief.
        for url_ in url_list:
            print "Processing {}...".format(url_)
            r_new = rq.get(url_)
            soup_new = bsoup(r_new.text)
            for tr in soup_new.find_all('tr', align='center'):
                stack = []
                for td in tr.findAll('td'):
                    stack.append(td.text.replace('\n', '').replace('\t', '').strip())
                acct.write(", ".join(stack) + '\n')
WGS
  • 13,969
  • 4
  • 48
  • 51
  • What could I do to determine the length or amount of available pages? – Philip McQuitty Oct 21 '14 at 23:12
  • @PhilipMcQuitty: There you go. I think that pretty much covers everything about this scrape. – WGS Oct 21 '14 at 23:50
  • 5
    you went above and beyond what I was hoping to get out of this question. Stackoverflow needs more users like you, this is a huge huge help. – Philip McQuitty Oct 22 '14 at 00:04
  • Glad to help. Absolutely love scraping so I try to help out as much as I can in these tags. Enjoy! – WGS Oct 22 '14 at 00:10
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/63447/discussion-between-philip-mcquitty-and-nanashi). – Philip McQuitty Oct 22 '14 at 00:11
  • @LaughingMan I have a very similar situation to the OP, but there are a few exceptions. I was wondering if you could help me out, I'd greatly appreciate it: http://stackoverflow.com/questions/32940355/how-to-scrape-multiple-pages-when-javascriptvoid – TheRealFakeNews Oct 05 '15 at 02:37
  • @NullDev it is not working with https://www.sustainalytics.com/esg-ratings – Kartik Punjabi Sep 28 '21 at 10:10