You have 2 options:
1) As stated by other, use Selenium or some other means, to render the page first, then you can extract the content from that rendered html.
2) Find the data embedded within the <script>
tags which in my experience helps me avoid selenium most of the time. The difficult part with that is locating it, then manipulating the string into a valid json format to be read through the json.loads()
.
I chose option 2:
import requests
import bs4
import json
res = requests.get('https://edition.cnn.com/politics')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
tags = soup.find_all('script')
for tag in tags:
if 'var CNN = CNN ||' in tag.text:
jsonStr = tag.text
jsonStr = jsonStr.split('siblings:')[-1].strip()
jsonStr = jsonStr.split(']',1)[0] + ']}'
jsonData = json.loads(jsonStr)
for article in jsonData['articleList']:
headline = article['headline']
link = 'https://edition.cnn.com' + article['uri']
print ('Headline: %s\nLink: %s\n\n' %(headline, link))
Output:
Headline: Trump ratchets up anti-impeachment rhetoric as troubles mount
Link: https://edition.cnn.com/2019/10/02/politics/president-donald-trump-impeachment-democrats-pompeo/index.html
Headline: Here's what happened in another wild day of the Trump-Ukraine scandal
Link: https://edition.cnn.com/2019/10/01/politics/ukraine-guide-rudy-giuliani-trump-whistleblower/index.html
Headline: All the President's men: Trump's allies part of a tangled web
Link: https://edition.cnn.com/2019/10/01/politics/trump-act-alone-ukraine-call/index.html
Headline: State Department inspector general requests briefing on Ukraine with congressional staff
Link: https://edition.cnn.com/2019/10/01/politics/deposition-delayed-impeachment-investigation/index.html
Headline: Senior GOP senator rebukes Trump, says whistleblower 'ought to be heard out'
Link: https://edition.cnn.com/2019/10/01/politics/grassley-whistleblower-statement/index.html
Headline: How Lindsey Graham's support for Trump — a man he once called a 'jackass' — has evolved
Link: https://edition.cnn.com/2019/10/01/politics/lindsey-graham-defends-trump-whistleblower/index.html
Headline: Federal judge blocks California law requiring Trump to release tax returns to appear on ballot
Link: https://edition.cnn.com/2019/10/01/politics/california-law-trump-tax-returns-blocked/index.html
...
HOW DID I KNOW TO SEARCH 'var CNN = CNN ||'?
Just takes a little investigating of the html. I could just to View source and then find
a headline within and just locate it's tag. Or what I usually do is I'll make little ad-hoc scripts that I throw away later as a way to narrow down the search:
1) I get every tag in the html
import requests
import bs4
import json
res = requests.get('https://edition.cnn.com/politics')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
# Get every tag in html
tags = soup.find_all()
2) Go through every tag to see if a headline is within the text.
The headlines change often, so I just go to the url on my browser and pick a substring from a main headline. If I go to https://edition.cnn.com/politics right now, one of the headline reads "Kurt Volker: Diplomat never 'fully on the Trump train' set to appear as first witness in Ukraine probe"
. Then I just see if a substring of that is present any where. If it is, then I can investigate further, if not, then I'm out of luck and need to see if I can get the data some other way
for tag in tags:
if "Kurt Volker: Diplomat never 'fully on the Trump train'" in tag.text:
tag_name = tag.name
print ('Possibly found article in %s tag' %tag_name)
And the read out:
Possibly found article in html tag
Possibly found article in head tag
Possibly found article in link tag
Possibly found article in link tag
Possibly found article in link tag
Possibly found article in link tag
Possibly found article in link tag
Possibly found article in link tag
Possibly found article in script tag
3) Ah ha, it is present. Knowing how html structure works, the html tag is the whole document and then each sequential tag is a descendant. My experience tells me that the leaf node/tag where I'll likely find this is in the script tag. So I will now search through the script tags.
scripts = soup.find_all('script')
print (len(scripts))
4) I see there are 28 <script>
tags, so which one do I want to look at?
for idx, script in enumerate(scripts):
if "Kurt Volker: Diplomat never 'fully on the Trump train'" in script.text:
print ('Headline found:\nIndex position %s' %idx)
5) Says it's in index position 1. So lets grab that:
scriptStr = scripts[1].text
print (scriptStr)
6) Now I see what I really likely need to search for in the <script>
tag is the tag that starts with 'var CNN'
in its text, as this will likely not change, while the headlines will, so now I can go back, and instead of looking for the headline substring, I'll just have it find the 'var CNN'
.
...
tags = soup.find_all('script')
for tag in tags:
if 'var CNN = CNN ||' in tag.text:
...
...
7) The last part (which I won't get into), is to then just trim off all the excess substrings within that to leave the valid json that contains all the data. Once you have that and left with the valid json substring, you can use json.loads()
to read that in, then can iterate through the dictionary/list that python stores that in.