1

I am learning BeautifulSoup and trying to scrape links of different questions that are present on this Quora page.

As I scroll down the website, questions present in the webpage keep coming up and displayed.

When I try to scrape the links to these questions using the code below, I only get,in my case, 5 links. ie - I only get links of 5 questions even though there are lot of questions on the site.

Is there any workaround to get as many links of questions present in the webpage?

from bs4 import BeautifulSoup
import requests

root = 'https://www.quora.com/topic/Graduate-Record-Examination-GRE-1'
headers = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.' }
r = requests.get(root,headers=headers)

soup = BeautifulSoup(r.text,'html.parser')

q = soup.find('div',{'class':'paged_list_wrapper'})
no=0
for i in q.find_all('div',{'class':'story_title_container'}):
    t=i.a['href']
    no=no+1
    print(root+t,'\n\n')
Ajay Negi
  • 71
  • 11
  • Scraping quora is against their terms of service. Read: https://www.quora.com/What-is-scraping-and-why-is-it-not-allowed-on-Quora – cs95 Dec 22 '18 at 14:29
  • @coldspeed Well, I need to make a dataset from this and this is only for educational purpose. Kindly, help !! – Ajay Negi Dec 22 '18 at 14:34

2 Answers2

1

What you are trying to accomplish cannot be done with Requests and BeautifulSoup. You need to use Selenium. Here i give the answer using selenium and chromedriver. Download chromedriver for you chrome version and install selenium pip install -U selenium

import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import csv
browser = webdriver.Chrome(executable_path='/path/to/chromedriver')
browser.get("https://www.quora.com/topic/Graduate-Record-Examination-GRE-1")
time.sleep(1)
elem = browser.find_element_by_tag_name("body")
no_of_pagedowns = 5
while no_of_pagedowns:
    elem.send_keys(Keys.PAGE_DOWN)
    time.sleep(0.2)
    no_of_pagedowns-=1
post_elems =browser.find_elements_by_xpath("//a[@class='question_link']")
for post in post_elems:
    print(post.get_attribute("href"))

If you are using windows - executable_path='/path/to/chromedriver.exe'

change this variable no_of_pagedowns = 5 to specify how many times you want to scroll down.

I got the following output

enter image description here

nandu kk
  • 368
  • 1
  • 10
  • Aah !! Somebody recommended me to learn `BeautifulSoup` for web-scraping. Well, it's really annoying to be introduced to a totally new framework. I, still, hope that this can be done using `BeautifulSoup`. Can't it be? – Ajay Negi Dec 22 '18 at 16:58
  • 1
    @rahul I don't think so. You can't directly use BeautifulSoup to execute events. Also BeautifulSoup is the first choice for web-scraping. But here you have to mimic scrolling. – nandu kk Dec 22 '18 at 17:18
  • You still use Beautifulsoup with selenium. Use selenium to open the browser, render the page, then the content grabbed by selenium can be passed into Beautifulsoup to parse. I’ve done it multiple times. – chitown88 Dec 22 '18 at 21:20
  • @chitown88 Yes. But you can't achieve the same only using Requests+bs4 combination. – nandu kk Dec 22 '18 at 21:29
  • 1
    @nandukk yes. You are correct. But he was wondering if he could still use Beautifulsoup. And he can, he’s just essentially switching out requests with selenium for those few lines, I do it all the time as I like working with Beautifulsoup. There’s also a package called ‘requests-html’ that allows to render dynamic pages, but you’re probably right that you’ll need selenium to scroll. I’m just not entirely familiar with ‘requests-html’ – chitown88 Dec 22 '18 at 23:08
  • @nandukk I request you to kindly explain your piece of code briefly for the sake of good understanding. Also, I want to close the browser at the end in MacOS. – Ajay Negi Dec 23 '18 at 09:56
  • @Nandukk Is it possible for me to know how many questions have been asked related to this topic. For this, I can't scroll for `no_of_pagedowns` of times. It needs to be much larger until the page ends. How shall I achieve this? – Ajay Negi Dec 23 '18 at 10:04
  • 1
    @rahul https://stackoverflow.com/questions/42982950/how-to-scroll-down-the-page-till-bottomend-page-in-the-selenium-webdriver – nandu kk Dec 23 '18 at 10:08
0

The title is grabbed from the page and printed after formatting. This is one way to do it i'm sure there are many ways to do this and this only does one question.

import requests
from bs4 import BeautifulSoup

URL = "https://www.quora.com/Which-Deep-Learning-online-course-is-better-Coursera-specialization-VS-Udacity-Nanodegree-vs-FAST-ai"

response = requests.get(URL)
soup = BeautifulSoup(response.text, 'html.parser')

# grabs the text in the title
question = soup.select_one('title').text
# removes - quora at the end
x = slice(-8) 

print(question[x])
Daniel
  • 61
  • 1
  • 6