Working with Scrapy 'regex definition'

Question

I have been trying to generate a script to scrape data from the website https://services.aamc.org/msar/home#null. I generated a python scrapy 2.7 script to get a piece of text from the website (I am aiming for anything at this point), but cannot seem to get it to work. I suspect this is because I have not configured my regex properly to identify the span tag I am trying to scrape from. Does anyone have any idea what I might be doing wrong and how I fix it?

Much appreciated.

Matt

import urllib
import re

url = "https://services.aamc.org/msar/home#null"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '<td colspan="2" class="schoolLocation">(.+?)</td>'
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
print "the school location is ",price

You can test your crawlers with the `scrapy shell`, additionally consider using `BeautifulSoup` instead. — Jan, May 02 '16 at 19:41
@Jan I think the question is mistagged, looks like the OP is not using neither of Scrapy or BeautifulSoup. — alecxe, May 02 '16 at 19:51

score 1 · Accepted Answer · edited May 23 '17 at 10:33

First of all, don't use regular expressions to parse HTML. There are specialized tools called HTML parsers, like BeautifulSoup or lxml.html.

Actually, the advice is not that relevant to this particular problem, since there is no need to parse HTML. The search results on this page are dynamically loaded from a separate endpoint to which a browser sends an XHR request, receives a JSON response, parses it and displays the search results with the help of javascript executed in the browser. urllib is not a browser and provide you with an initial page HTML only with an empty search results container.

What you need to do is to simulate the XHR request in your code. Let's use requests package. Complete working code, printing a list of school programs:

import requests


url = "https://services.aamc.org/msar/home#null"
search_url = "https://services.aamc.org/msar/search/resultData"

with requests.Session() as session:
    session.get(url)  # visit main page

    # search
    data = {
        "start": "0",
        "limit": "40",
        "sort": "",
        "dir": "",
        "newSearch": "true",
        "msarYear": ""
    }
    response = session.post(search_url, data=data)

    # extract search results
    results = response.json()["searchResults"]["rows"]
    for result in results:
        print(result["schoolProgramName"])

Prints:

Albany Medical College
Albert Einstein College of Medicine
Baylor College of Medicine
...
Howard University College of Medicine
Howard University College of Medicine Joint Degree Program
Icahn School of Medicine at Mount Sinai

This is excellent, thank you. I can see where you are looking in the source code for the data too, and I can see that you can change the print(result["xxxx"]) to be whatever variable you would like within the data. A couple questions: how did you find the search_url? What do the "data = {...} control? (specifically "sort" "dir" "newsearch" and "msarYear"? Much appreciated. Matt — mg520, May 03 '16 at 01:44
@MattGrossman sure, please see if the answer deserves to be accepted. Thanks. — alecxe, May 03 '16 at 01:46
Hopefully I did it right. Just started as a user on this site. Can you answer these questions as well? — mg520, May 03 '16 at 01:58
@MattGrossman sure, I've used browser developer tools and inspected the requests going on while the search results were loaded. The `data` just replicates the request post parameters. Hope that helps. — alecxe, May 03 '16 at 15:31
Don't quite understand this, but then again I'm a novice. Thanks so much for the help! I'll try to figure out the rest from here... — mg520, May 03 '16 at 18:13

Working with Scrapy 'regex definition'

1 Answers1