0

My goal is to collect a maximum of profile links on Khan Academy and then select some specific data on each of these profiles to store them into a CSV file.

Here is my script to get profile links. Then scrape specific data on each of these profiles. And then store them in a csv file.

from bs4 import BeautifulSoup
from requests_html import HTMLSession
import re

session = HTMLSession()
r = session.get('https://www.khanacademy.org/computing/computer-science/algorithms/intro-to-algorithms/v/what-are-algorithms')
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')

#find the profile links
profiles = soup.find_all(href=re.compile("/profile/kaid"))
profile_list=[]
for links in profiles:
    links_no_list = links.extract()
    text_link = links_no_list['href']
    text_link_nodiscussion = text_link[:-10]
    final_profile_link ='https://www.khanacademy.org'+text_link_nodiscussion
    profile_list.append(final_profile_link)

#create the csv file
filename = "khanscraptry1.csv"
f = open(filename, "w")
headers = "link, date_joined, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx, last_date\n"
f.write(headers)

#for each profile link, scrape the specific data and store them into the csv
for link in profile_list: 
    print("Scrapping ",link)
    session = HTMLSession()
    r = session.get(link)
    r.html.render(sleep=5)
    soup=BeautifulSoup(r.html.html,'html.parser')
    user_info_table=soup.find('table', class_='user-statistics-table')
    if user_info_table is not None:
        dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
    else:
        dates=points=videos='NA'

    user_socio_table=soup.find_all('div', class_='discussion-stat')
    data = {}
    for gettext in user_socio_table:
        category = gettext.find('span')
        category_text = category.text.strip()
        number = category.previousSibling.strip()
        data[category_text] = number

    full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks'] #might change answers to answer because when it's 1 it's putting NA instead
    for header_value in full_data_keys:
        if header_value not in data.keys():
            data[header_value]='NA'

    user_calendar = soup.find('div',class_='streak-calendar-scroll-container')
    if user_calendar is not None:
        last_activity = user_calendar.find('span',class_='streak-cell filled')
        try:
            last_activity_date = last_activity['title']
        except TypeError:
            last_activity_date='NA'
    else:
        last_activity_date='NA'
    f.write(link + "," + dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "," + last_activity_date + "\n") #might change answers to answer because when it's 1 it's putting NA instead

f.close()

This first script should work fine. Now, my problem is that this script found about 40 profile links: print(len(profile_list)) return 40.

If I could click on show more button (on : https://www.khanacademy.org/computing/computer-science/algorithms/intro-to-algorithms/v/what-are-algorithms), then I will get more profile links (and thus more profiles to scrape).

That script is infinitely clicking on show more button, until there is no show more button:

import unittest
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome() #watch out, change if you are not using Chrome
driver.get("https://www.khanacademy.org/computing/computer-science/algorithms/intro-to-algorithms/v/what-are-algorithms")
driver.implicitly_wait(10)

def showmore(self):
       while True:
             try:
               driver.implicitly_wait(5)
               showmore = self.find_element_by_class_name("button_1eqj1ga-o_O-shared_1t8r4tr-o_O-default_9fm203")
               showmore.click()
             except NoSuchElementException:
               break

showmore(driver)

This second script should also work fine.

My question is: how can I merge these two scripts? How to make BeautifulSoup, Selenium and Requests work together?

In other words: How can I apply the second script to get a full page and then treat it into the first script?

RobZ
  • 496
  • 1
  • 10
  • 26
  • 1
    It seems that the same profile is being scrapped two times, is this required? – Bitto Mar 02 '19 at 18:42
  • No it is not required (actually I would like to have unique links), I can do `link in profile_list[::2]:` instead of `link in profile_list:` like you said here in your comment `https://stackoverflow.com/a/54892333/10972294`. But your proposition down below is way more convenient. – RobZ Mar 02 '19 at 19:58

1 Answers1

1

My question is: how can I merge these two scripts? How to make BeautifulSoup, Selenium and Requests work together?

You don't need to. Selenium alone can do all of the actions required as well as get the data required. Another alternative is to use selenium do actions (such as click), get the page_source and let BeautifulSoup do the parsing. I have used the second option. Please note that this is b'coz I am more comfortable with BeautifulSoup and not b'coz selenium can't get the data required.

Merged Script

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException,StaleElementReferenceException
from bs4 import BeautifulSoup
import re
driver = webdriver.Chrome() #watch out, change if you are not using Chrome
driver.get("https://www.khanacademy.org/computing/computer-science/algorithms/intro-to-algorithms/v/what-are-algorithms")
while True:
    try:
        showmore=WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH ,'//*[@id="v/what-are-algorithms-panel"]/div[1]/div/div[6]/div/div[4]/button')))
        showmore.click()
    except TimeoutException:
        break
    except StaleElementReferenceException:
        break

soup=BeautifulSoup(driver.page_source,'html.parser')
#find the profile links
profiles = soup.find_all(href=re.compile("/profile/kaid"))
profile_list=[]
for links in profiles:
    links_no_list = links.extract()
    text_link = links_no_list['href']
    text_link_nodiscussion = text_link[:-10]
    final_profile_link ='https://www.khanacademy.org'+text_link_nodiscussion
    profile_list.append(final_profile_link)

#remove duplicates
#remove the below line if you want the dupliactes
profile_list=list(set(profile_list))

#print number of profiles we got
print(len(profile_list))
#create the csv file
filename = "khanscraptry1.csv"
f = open(filename, "w")
headers = "link, date_joined, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx, last_date\n"
f.write(headers)


#for each profile link, scrape the specific data and store them into the csv
for link in profile_list:
    #to avoid Scrapping same profile multiple times
    #print each profile link we are about to scrap
    print("Scrapping ",link)
    driver.get(link)
    #wait for content to load
    #if profile does not exist skip
    try:
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH ,'//*[@id="widget-list"]/div[1]/div[1]')))
    except TimeoutException:
        continue
    soup=BeautifulSoup(driver.page_source,'html.parser')
    user_info_table=soup.find('table', class_='user-statistics-table')
    if user_info_table is not None:
        dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
    else:
        dates=points=videos='NA'

    user_socio_table=soup.find_all('div', class_='discussion-stat')
    data = {}
    for gettext in user_socio_table:
        category = gettext.find('span')
        category_text = category.text.strip()
        number = category.previousSibling.strip()
        data[category_text] = number

    full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks'] #might change answers to answer because when it's 1 it's putting NA instead
    for header_value in full_data_keys:
        if header_value not in data.keys():
            data[header_value]='NA'

    user_calendar = soup.find('div',class_='streak-calendar-scroll-container')
    if user_calendar is not None:
        last_activity = user_calendar.find('span',class_='streak-cell filled')
        try:
            last_activity_date = last_activity['title']
        except TypeError:
            last_activity_date='NA'
    else:
        last_activity_date='NA'
    f.write(link + "," + dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "," + last_activity_date + "\n")

Sample console Output

551
Scrapping  https://www.khanacademy.org/profile/kaid_888977072825430260337359/
Scrapping  https://www.khanacademy.org/profile/kaid_883316191998827325047066/
Scrapping  https://www.khanacademy.org/profile/kaid_1174374133389372329315932/
Scrapping  https://www.khanacademy.org/profile/kaid_175131632601098270919916/
Scrapping  https://www.khanacademy.org/profile/kaid_120532771190025953629523/
Scrapping  https://www.khanacademy.org/profile/kaid_443636490088836886070300/
Scrapping  https://www.khanacademy.org/profile/kaid_1202505937095267213741452/
Scrapping  https://www.khanacademy.org/profile/kaid_464949975690601300556189/
Scrapping  https://www.khanacademy.org/profile/kaid_727801603402106934190616/
Scrapping  https://www.khanacademy.org/profile/kaid_808370995413780397188230/
Scrapping  https://www.khanacademy.org/profile/kaid_427134832219441477944618/
Scrapping  https://www.khanacademy.org/profile/kaid_232193725763932936324703/
Scrapping  https://www.khanacademy.org/profile/kaid_167043118118112381390423/
Scrapping  https://www.khanacademy.org/profile/kaid_17327330351684516133566/
...

Sample File Output (khanscraptry1.csv)

link, date_joined, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx, last_date
https://www.khanacademy.org/profile/kaid_888977072825430260337359/,NA,NA,NA,NA,0,0,0,NA,NA,0,0,Tuesday Dec 8 2015
https://www.khanacademy.org/profile/kaid_883316191998827325047066/,5 years ago,2152299,513,10,884,34,16,82,108,1290,360,Monday Aug 27 2018
https://www.khanacademy.org/profile/kaid_1174374133389372329315932/,NA,NA,NA,2,0,0,0,NA,NA,0,0,NA
https://www.khanacademy.org/profile/kaid_175131632601098270919916/,NA,NA,NA,173,19,2,0,NA,NA,128,3,Thursday Feb 7 2019
https://www.khanacademy.org/profile/kaid_120532771190025953629523/,NA,NA,NA,9,0,3,18,NA,NA,4,4,Tuesday Oct 11 2016
https://www.khanacademy.org/profile/kaid_443636490088836886070300/,7 years ago,3306784,987,10,231,49,11,8,156,10,NA,Sunday Jul 22 2018
https://www.khanacademy.org/profile/kaid_1202505937095267213741452/,NA,NA,NA,2,0,0,0,NA,NA,0,0,Thursday Apr 28 2016
https://www.khanacademy.org/profile/kaid_464949975690601300556189/,NA,NA,NA,NA,0,0,0,NA,NA,0,0,Friday Mar 16 2018
https://www.khanacademy.org/profile/kaid_727801603402106934190616/,5 years ago,2927634,1049,6,562,332,9,NA,NA,20,NA,NA
https://www.khanacademy.org/profile/kaid_808370995413780397188230/,NA,NA,NA,NA,19,192,0,NA,NA,52,NA,Saturday Jan 19 2019
https://www.khanacademy.org/profile/kaid_427134832219441477944618/,NA,NA,NA,2,0,0,0,NA,NA,0,0,Tuesday Sep 18 2018
https://www.khanacademy.org/profile/kaid_232193725763932936324703/,NA,NA,NA,NA,0,0,0,NA,NA,0,0,Monday May 15 2017
https://www.khanacademy.org/profile/kaid_167043118118112381390423/,NA,NA,NA,NA,0,0,0,NA,NA,0,0,Friday Mar 1 2019
https://www.khanacademy.org/profile/kaid_17327330351684516133566/,NA,NA,NA,NA,0,0,0,NA,NA,0,0,NA
https://www.khanacademy.org/profile/kaid_146705727466233630898864/,NA,NA,NA,NA,0,0,0,NA,NA,0,0,Thursday Apr 5 2018
Bitto
  • 7,937
  • 1
  • 16
  • 38
  • script is working fine thanks! Now tough we are working on this link `https://www.khanacademy.org/computing/computer-science/algorithms/intro-to-algorithms/v/what-are-algorithms` Can we apply the merged script but on multiple `driver.get()` links? For example : I take this link `https://www.khanacademy.org/computing/computer-programming/programming#intro-to-programming` Then create a list of every content course links with `class='link_1uvuyao-o_O-nodeStyle_cu2reh-o_O-nodeStyleIcon_4udnki'` Can driver.get() take the output of this list (and then execute the merged script above)? – RobZ Mar 03 '19 at 15:04
  • I'm not sure if I was clear enough with my previous comment. Let me know if I should create a post for it. (I do have some code, but not really working). – RobZ Mar 03 '19 at 15:08
  • 1
    @RobZ You could store each link and the class to a dictionary and the loop through it. It is always better to ask a new question. Other users may be able to find out better solutions to the problem which you or me may not find. – Bitto Mar 03 '19 at 16:46
  • I think I've coded the dictionary and the loop right : `https://stackoverflow.com/q/54975740/10972294` ? But I can't be for sure the script is taking way too much time. – RobZ Mar 04 '19 at 12:52