0

I was developing a web scraper to obtain full curriculum from a UDEMY course. I used beautiful soup and request in python. Although, some in the page the last sections of the curriculum is collapsed and we have to click to expand. How to extract the entire curriculum?

URL: https://www.udemy.com/python-the-complete-python-developer-course/

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as Soup

my_url = "https://www.udemy.com/python-the-complete-python-developer-course/"
head = {'User-Agent':'Mozilla/5.0'}
pagereq = Request(my_url, headers=head)

pager = urlopen(pagereq)

page = pager.read()
pager.close()
Sp = Soup(page, "html.parser")
Sections = Sp.findAll("div", {"class": "content-container"})
numlec = Sp.find("div", {"class": "num-lectures"})

for section in Sections:
    SecTitle = section.find("span", {"class": "lecture-title-text"}).text.strip()
    SecLen = section.find("span", {"class": "section-header-length"}).text.strip()
    lectures = section.findAll("div", {"class": "lecture-container"})
    print("-" * 40)
    print(SecTitle+"\t"+SecLen)
    print()
    for lecture in lectures:
        name = lecture.find("div", {"class": "title"}).text.strip()
        leng = lecture.find("span", {"class": "content-summary"}).text.strip()
        print("\t {}\t{}".format(name, leng))
    print("-" * 40)

This will scrape all data till the collapsed text. But I want the full curriculum. Is there any easy way to do this?

Arun Baby
  • 67
  • 6

1 Answers1

0

Try this. It will firstly click on the 7 more sections button then it will click on each plus sign button to unfold all the hidden items and finally It will fetch you all the titles and it's curriculum under the course from that page.

from selenium import webdriver
import time

driver = webdriver.Chrome()
driver.get("https://www.udemy.com/python-the-complete-python-developer-course/")
time.sleep(2)

driver.find_element_by_css_selector(".content-container.js-load-more").click()

for link in driver.find_elements_by_css_selector('.lecture-title-text'):
    link.click()
    time.sleep(2)

for items in driver.find_elements_by_css_selector(".content-container"):
    title = items.find_element_by_css_selector(".lecture-title-text").text
    course_list = ' '.join([item.text for item in items.find_elements_by_css_selector(".title")])
    print("Course_title: {}\nCurriculum: {}\n".format(title,course_list))

driver.quit()

Partial output:

Course_title: Introduction
Curriculum: 

Course_title: Python Setup for Windows
Curriculum: Introduction Install Python on Windows IDLE On Windows with a cool demo app! Downloading and Installing IntelliJ (FREE and PAID versions) on Windows Free 90 Day Extended Trial of IntelliJ Ultimate Edition Now Available Move to next section!

Course_title: Python Setup for Mac
Curriculum: Introduction Downloading And Installing Python On Mac OS X IDLE on Mac OS X with a cool demo app! Downloading and Installing IntelliJ (FREE and PAID version) for a Mac Free 90 Day Extended Trial of IntelliJ Ultimate Edition Now Available Move to next section!
SIM
  • 21,997
  • 5
  • 37
  • 109