0

I'm learning scraping with Beautifulsoup and am using Stackoverflow's interesting questions section ("https://stackoverflow.com/?tab=interesting") for practice.

I want to extract hyperlinks for the top 5 questions that the user has tagged with 'java' AND that has at least one answer (ok if the answer has been accepted but not a requirement).

I've looked at the Beautifulsoup documentation, but I can't get it to come together.

Thanks for any help!

CODE:

from bs4 import BeautifulSoup
from urllib.request import urlopen

html = urlopen("https://stackoverflow.com/?tab=interesting")
content = html.read()
soup = BeautifulSoup(content)

soup.findAll('a',{'class':'question-hyperlink'}, href = True ,  limit=5)        # question link 
soup.findAll('div', {'class':'status answered'},  limit=5)                      # question answer 
soup.findAll('a',{'class':'post-tag'}, rel ='tag' , text = 'java',  limit=5)    # question user tag

DESIRED OUTPUT (as hyperlinks):

https://stackoverflow.com/questions/number/first-question-to-meet-the-criteria
https://stackoverflow.com/questions/number/second-question-to-meet-the-criteria
https://stackoverflow.com/questions/number/third-question-to-meet-the-criteria
https://stackoverflow.com/questions/number/forth-question-to-meet-the-criteria
https://stackoverflow.com/questions/number/fifth-question-to-meet-the-criteria 
Captain Jack Sparrow
  • 971
  • 1
  • 11
  • 28
Robbie
  • 275
  • 4
  • 20

1 Answers1

0

Try this:

from bs4 import BeautifulSoup
import requests

html = requests.get("https://stackoverflow.com/?tab=interesting")
soup = BeautifulSoup(html.content)

# find and iterate over all parent divs of questions
for elem in soup.findAll('div',{'class':'question-summary narrow'}):
    # get count of answers
    answer = elem.find("div", {"class": "mini-counts"})
    if answer.text != "0":
        # check if question is tagged with "Java"
        tags = elem.find("div", {"class": "t-java"})
        if tags is not None:
            # print link  
            print(elem.find("a")["href"])

If you don't get a printout try changing the tag to t-python for example.

petezurich
  • 9,280
  • 9
  • 43
  • 57
  • 1
    I believe it's recommend to use `.content`, not `.text`, to feed the result of the request to BeautifulSoup. – AMC Feb 08 '20 at 16:40
  • Why do you think so? `.content` returns byte code rather than text and therefore throws an error. `.text` yields the raw HTML as text and works fine. – petezurich Feb 08 '20 at 16:42
  • 1
    There's some information on the subject [here](https://stackoverflow.com/q/36833357/11301900) and [here](https://stackoverflow.com/questions/40163323/should-i-use-text-or-content-when-parsing-a-requests-response/40163370), for example. – AMC Feb 08 '20 at 16:48
  • 1
    Also, testing for `None` should probably be done using `is`, instead of `==`. – AMC Feb 08 '20 at 16:48
  • I stand corrected. Didn't know that. Thanks for the hint! – petezurich Feb 08 '20 at 16:50