0

I'm trying to webscrape course names and links from edx.org. C Currently my code looks like this.

from bs4 import BeautifulSoup
import requests
baseUrl = 'https://www.edx.org/course?search_query='
userString = 'python'  
scrapeUrl = baseUrl + userString

result = requests.get(scrapeUrl)
print('Response Status:', result.status_code)
src = result.content
soup = BeautifulSoup(src)

f = open('response.html', 'w')
f.write(str(soup))

print(result.headers)
print(soup.find('a', {"class": "course-link"}))

The problem I face is that the last bit of my output from the soup.found('a', {'class': 'course-link'}) part is None. From what I can tell, the page that get's returned is just a raw 'template' of sorts and the actual course objects aren't returned to the page yet. (I tried doing a hard refresh using cmd+shift+r and the courses took some time to actually come on the page. )

I've honestly not done much research on this yet. Perhaps the solution lies in the requests module and not the beautiful soup module.

I think I'll have to look for a way to wait until all the courses have loaded.

I'd appreciate any help.

1 Answers1

3

A webpage isn't just a single page (anymore), but is often built up from many requests. The result of a request.get() is a single piece of data, perhaps the initial HTML page which browsers read and interpret and then use to request more data.

Python request doesn't do that: doesn't read and interpret the page. It just gets what you requested.

So, you need either something which works like a browser: which gets the first page, loads additional resources and interprets the javascript (which may cause loading of more resources.) Selenium is a great tool for that.

Or, you need to look at the page to see what's being loaded & perhaps make that request instead.

For example, look at www.edx.org page using a browser debugger & you'll see it (the home page) loads a file called subjects (https://www.edx.org/api/v1/catalog/subjects actually)

And, if you look at that file, you'll see it's json:

{"count": 31,
 "next": null,
 "previous": null,
 "results": [
    {
        "name": "Computer Science",
        "subtitle": "<p>Take online computer science courses from top institutions including Harvard, MIT and Microsoft. Learn to code with computer science courses including programming, web design, and more.</p>",
        "description": "<p>Enroll in the latest computer science courses covering Python, C programming, R, Java, artificial intelligence, cybersecurity, software engineering, and more. Learn from Harvard, MIT, Microsoft, IBM, and other top institutions. Join today.</p>\n<p>Related Topics - <a href=\"/learn/computer-programming\">Programming</a> | <a href=\"/learn/android-development\">Android Development</a> | <a href=\"/learn/apache-spark\">Apache Spark</a> | <a href=\"/learn/app-development\">App Development</a> | <a href=\"/learn/artificial-intelligence\">Artificial Intelligence</a> | <a href=\"/learn/azure\">Azure</a> | <a href=\"https://www.edx.org/learn/big-data\">Big Data</a> | <a href=\"/learn/blockchain-cryptography\">Blockchain</a> | <a href=\"https://www.edx.org/learn/c-programming\">C</a> | <a href=\"https://www.edx.org/learn/c-plus-plus\">C++</a> | <a href=\"https://www.edx.org/learn/c-sharp\">C#</a> | <a href=\"/learn/cloud-computing\">Cloud Computing</a> | <a href=\"/learn/cybersecurity\">Cybersecurity</a> | <a href=\"https://www.edx.org/learn/data-science\">Data Science</a> | <a href=\"https://www.edx.org/learn/data-analysis\">Data Analysis</a> | <a href=\"/learn/databases\">Databases</a> | <a href=\"https://www.edx.org/learn/devops\">Devops</a> | <a href=\"/learn/front-end-web-development\">Front End Web Development</a> | <a href=\"/learn/hadoop\">Hadoop</a> | <a href=\"/learn/html\">HTML</a> | <a href=\"/learn/information-technology\">Information Technology</a> | <a href=\"/learn/java\">Java</a> | <a href=\"/learn/javascript\">JavaScript</a> | <a href=\"/learn/linux\">Linux</a> | <a href=\"/learn/machine-learning\">Machine Learning</a> | <a href=\"/learn/matlab\">Matlab</a> | <a href=\"/learn/mobile-development\">Mobile Development</a> | <a href=\"/learn/python\">Python</a> | <a href=\"https://www.edx.org/learn/r-programming\">R</a> | <a href=\"/learn/robotics\">Robotics</a> | <a href=\"https://www.edx.org/learn/software-engineering\">Software Engineering</a> | <a href=\"https://www.edx.org/learn/sql\">SQL</a> | <a href=\"/learn/t-sql\">T-SQL</a> | <a href=\"https://www.edx.org/learn/user-experience-ux\">UX Design</a> | <a href=\"https://www.edx.org/learn/virtual-reality\">Virtual Reality</a> | <a href=\"/learn/web-development\">Web Development</a> | <a href=\"https://www.edx.org/learn/web-design\">Web Design</a> | <a href=\"https://www.edx.org/masters/online-master-science-computer-science-utaustinx\">Master's in Computer Science</a> | <a href=\"https://www.edx.org/masters/online-master-science-analytics-georgia-tech\">Master's in Analytics</a> | <a href=\"https://www.edx.org/masters/online-master-data-science-uc-san-diego\">Master's in Data Science</a></p>",
        "banner_image_url": "https://www.edx.org/sites/default/files/cs-1440x210.jpg",
        "card_image_url": "https://www.edx.org/sites/default/files/subject/image/card/computer-science.jpg",
        "slug": "computer-science",
        "uuid": "e52e2134-a4e4-4fcb-805f-cbef40812580"
    },
... etc.

So, depending on what you want to do, you might just use request.get('https:/www.edu.org/api/v1/catalog/subjects'), convert it from json to a python object and problem solved!

pbuck
  • 4,291
  • 2
  • 24
  • 36
  • Wanted to confirm that the `requests` module is the same as `urllib.request`. – Rahul Tandon Dec 22 '19 at 08:28
  • @RahulTandon, requests module has more features than urllib.request. You should get familiar and use the module -- it's really good (depending on your installation, you may have to install the requests module as it does not come with base python.) – pbuck Dec 22 '19 at 20:42