Using BeautifulSoup to check if a string exists on a dynamically generated page

Question

I am trying to check the status of an account on Twitter.com. The site does not use clear container names as they are dynamically generated I think so I am instead matching on text strings. Inspired by this question, I expect the following code to work but it returns an empty list:

import requests
from bs4 import BeautifulSoup
page = requests.Session().get('https://twitter.com/MikeEPeinovich')
page = page.content
soup = BeautifulSoup(page, "lxml")
print soup.findAll(text="Account suspended")

...and here's a variation using a different request library and HTML parser (same end result though):

import urllib2
from bs4 import BeautifulSoup

page = urllib2.urlopen('https://twitter.com/MikeEPeinovich')
soup = BeautifulSoup(page, "html.parser")
print soup.findAll(text="Account suspended")

Any advice on what I'm doing wrong? Thanks!

UPDATE

It was rightly pointed out to me below that I needed something like Selenium to mimic browser behaviour in order to capture the fully loaded, dynamic webpage object so I've integrated Selenium and Mozilla's Gecko Browser into the script. On inspecting the soup object though, I'm still clearly not grabbing everything. This is the script I'm using now:

# With Selenium
from bs4 import BeautifulSoup
from selenium.webdriver.firefox.options import Options as FirefoxOptions
from selenium import webdriver

url = "https://twitter.com/MikeEPeinovich"

options = FirefoxOptions()
options.add_argument("--headless")
browser = webdriver.Firefox(options=options)
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
print soup.findAll(text="Account suspended")

If you print `page` you get a whole bunch of `script` tags indicating that the page loads the content (including the text you are searching for) using javascript. — , Dec 07 '20 at 14:19

druskacik · Answer 1 · 2020-12-07T14:34:05.917

It's because the string really isn't there. BeautifulSoup makes only initial request on a page which will often lack some of the content (additional content is loaded by JavaScript). If you go on the page you mention in the question and press Control + u, you will not find the string "Account suspended" there. And this is the same html that the requests library sees.

As a solution you can use for example Selenium to load webpage like it would in the browser. Or you can go to Network tab in your browser's Developer Tools to see what requests Twitter does in the background. I checked it and account info was retrieved in one of the requests, but I was not able to replicate the request in Postman (which is not surprising, a website such big as Twitter must have a good security).

Update:

See for example this question: Wait page to load before getting data with requests.get in python 3

Thanks for the input. I've integrated Selenium and Firefox's headless brower, Gecko Browser, into my script but when I inspect the `soup` object that I get back, that string is still not in there so I suspect that I need to play around with the loading time parameters or something... — jim_jones, Dec 07 '20 at 15:03

score 1 · Answer 2 · answered Dec 07 '20 at 14:41

Page was generated by Javascript.

So you could scrape the ajax API(use a correct headers with some parameters), like:

import requests

headers = {
    'x-csrf-token': '11a1d4eb65d6b52fb22ef8c0377013bf',
    'authorization': 'Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36',
    'x-guest-token': '1335956221107572737',
    'cookie': 'personalization_id="v1_/4NldbdRSml+BviPBqfJVg=="; guest_id=v1%3A160735174410977274; ct0=11a1d4eb65d6b52fb22ef8c0377013bf; _twitter_sess=BAh7CSIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNo%250ASGFzaHsABjoKQHVzZWR7ADoPY3JlYXRlZF9hdGwrCLdgoT12AToMY3NyZl9p%250AZCIlN2I4Y2YzMThjODBkZmQ5NjkzMGQyN2UyNTZmODAxMGQ6B2lkIiU1OWYw%250ANjc5OWI5OGMyYmViOGNlMWE0ZWNkNzdiMjQyYw%253D%253D--ea9af5c4c148aee6204c39ddd96cc43125ee9893; gt=1335956221107572737',
}

username = "MikeEPeinovich"

params = (
    ('variables', '{"screen_name":"MikeEPeinovich","withHighlightedLabel":true}'),
)

response = requests.get('https://api.twitter.com/graphql/esn6mjj-y68fNAj45x5IYA/UserByScreenName', headers=headers, params=params)
print(response.json()["errors"][0]["message"])

To get the error message:

Authorization: User has been suspended. (63)

Elegant. However I suspect that this may stop working after some time when the `cookie` and `authorization` headers will expire. I had this problem some time ago and solved it like this: first I did an ordinary request with `request` from which I got `cookie` value. Then I inserted the `cookie` value into the headers and it worked. — druskacik, Dec 07 '20 at 15:10
@druskacik This would only work for a while due to cookie would expire. To receive those cookies with `requests` module, maybe need `requests.Session` and need to send requests like browser did. — jizhihaoSAMA, Dec 07 '20 at 15:22

Using BeautifulSoup to check if a string exists on a dynamically generated page

2 Answers2