Web scraping hidden texts not available in page source?

Question

I was developing a web scraper to obtain full curriculum from a UDEMY course. I used beautiful soup and request in python. Although, some in the page the last sections of the curriculum is collapsed and we have to click to expand. How to extract the entire curriculum?

URL: https://www.udemy.com/python-the-complete-python-developer-course/

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as Soup

my_url = "https://www.udemy.com/python-the-complete-python-developer-course/"
head = {'User-Agent':'Mozilla/5.0'}
pagereq = Request(my_url, headers=head)

pager = urlopen(pagereq)

page = pager.read()
pager.close()
Sp = Soup(page, "html.parser")
Sections = Sp.findAll("div", {"class": "content-container"})
numlec = Sp.find("div", {"class": "num-lectures"})

for section in Sections:
    SecTitle = section.find("span", {"class": "lecture-title-text"}).text.strip()
    SecLen = section.find("span", {"class": "section-header-length"}).text.strip()
    lectures = section.findAll("div", {"class": "lecture-container"})
    print("-" * 40)
    print(SecTitle+"\t"+SecLen)
    print()
    for lecture in lectures:
        name = lecture.find("div", {"class": "title"}).text.strip()
        leng = lecture.find("span", {"class": "content-summary"}).text.strip()
        print("\t {}\t{}".format(name, leng))
    print("-" * 40)

This will scrape all data till the collapsed text. But I want the full curriculum. Is there any easy way to do this?

You can use Selenium for Scrapping. Using that you can send Click Events to all collapsed bars. — planet260, Dec 05 '17 at 07:10
FYI it's "scraping", "to scrape" and "a scraper", like scratching. — data, Dec 05 '17 at 07:20

score 0 · Accepted Answer · answered Dec 05 '17 at 16:59

Try this. It will firstly click on the 7 more sections button then it will click on each plus sign button to unfold all the hidden items and finally It will fetch you all the titles and it's curriculum under the course from that page.

from selenium import webdriver
import time

driver = webdriver.Chrome()
driver.get("https://www.udemy.com/python-the-complete-python-developer-course/")
time.sleep(2)

driver.find_element_by_css_selector(".content-container.js-load-more").click()

for link in driver.find_elements_by_css_selector('.lecture-title-text'):
    link.click()
    time.sleep(2)

for items in driver.find_elements_by_css_selector(".content-container"):
    title = items.find_element_by_css_selector(".lecture-title-text").text
    course_list = ' '.join([item.text for item in items.find_elements_by_css_selector(".title")])
    print("Course_title: {}\nCurriculum: {}\n".format(title,course_list))

driver.quit()

Partial output:

Course_title: Introduction
Curriculum: 

Course_title: Python Setup for Windows
Curriculum: Introduction Install Python on Windows IDLE On Windows with a cool demo app! Downloading and Installing IntelliJ (FREE and PAID versions) on Windows Free 90 Day Extended Trial of IntelliJ Ultimate Edition Now Available Move to next section!

Course_title: Python Setup for Mac
Curriculum: Introduction Downloading And Installing Python On Mac OS X IDLE on Mac OS X with a cool demo app! Downloading and Installing IntelliJ (FREE and PAID version) for a Mac Free 90 Day Extended Trial of IntelliJ Ultimate Edition Now Available Move to next section!

Thanks. Sorry for the late reply. I got something similar. – Arun Baby Dec 11 '17 at 15:55 — Arun Baby, Dec 11 '17 at 15:55

Web scraping hidden texts not available in page source?

1 Answers1