I am trying to extract information from this page. The page loads 10 items at a time, and I need to scroll to load all entries (for a total of 100). I am able to parse the HTML and get the information that I need for the first 10 entries, but I want to fully load all entries before parsing the HTML.
I am using python, requests, and BeautifulSoup. The way I parse the page when it loads with the first 10 entries is as follows:
from bs4 import BeautifulSoup
import requests
s = requests.Session()
r = s.get('https://medium.com/top-100/december-2013')
page = BeautifulSoup(r.text)
But this only loads the first 10 entries. So I looked at the page and got the AJAX request used to load the subsequent entries and I get a response but it's in the a funky JSON and I'd rather use the HTML parser instead of parsing JSON. Here's the code:
from bs4 import BeautifulSoup
import requests
import json
s = requests.Session()
url = 'https://medium.com/top-100/december-2013/load-more'
payload = {"count":100}
r = s.post(url, data=payload)
page = json.loads(r.text[16:]) #skip some chars that throw json off
This gives me the data but it's in a very long and convoluted JSON, I would much rather load all the data on the page and simply parse the HTML. In addition, the rendered HTML provides more information than the JSON response (i.e. the name of the author instead of obscure userID, etc.) There was a similar question here but no relevant answers. Ideally I want to make the POST call and then request the HTML and parse it, but I haven't been able to do that.