0

I'm trying to scrape the links for the top 10 articles on medium each day - by the looks of it, it seems like all the article links are in the class "postArticle-content," but when I run this code, I only get the top 3. Is there a way to get all 10?

from bs4 import BeautifulSoup
import requests

r = requests.get("https://medium.com/browse/726a53df8c8b")
data = r.text
soup = BeautifulSoup(data)

data = soup.findAll('div', attrs={'class' : 'postArticle-content'}) 
for div in data:
    links = div.findAll('a')
    for link in links:
        print(link.get('href'))
stk1234
  • 1,036
  • 3
  • 12
  • 29

1 Answers1

1

requests gave you the entire results.

That page contains only the first three. The website's design is to use javascript code, running in the browser, to load additional content and add it to the page.

You need an entire web browser, with a javascript engine, to do what you are trying to do. The requests and beautiful-soup libraries are not a web browser. They are merely an implementation of the HTTP protocol and an HTML parser, respectively.

dsh
  • 12,037
  • 3
  • 33
  • 51
  • That makes sense - would feedparser or selenium be the sort of library to take a look at? http://stackoverflow.com/questions/28499274/scraping-a-javascript-generated-page-using-python – stk1234 Feb 28 '17 at 13:47
  • If the site provides you with an Atom or RSS feed, then using that (with feedparser) would be suitable. Selenium would also be suitable, as it lets you easily automate the operation of a complete browser. – dsh Feb 28 '17 at 21:39