87

I am trying to use the Requests framework with python (http://docs.python-requests.org/en/latest/) but the page I am trying to get to uses javascript to fetch the info that I want.

I have tried to search on the web for a solution but the fact that I am searching with the keyword javascript most of the stuff I am getting is how to scrape with the javascript language.

Is there anyway to use the requests framework with pages that use javascript?

biw
  • 3,000
  • 4
  • 23
  • 40

7 Answers7

107

Good news: there is now a requests module that supports javascript: https://pypi.org/project/requests-html/

from requests_html import HTMLSession

session = HTMLSession()

r = session.get('http://www.yourjspage.com')

r.html.render()  # this call executes the js in the page

As a bonus this wraps BeautifulSoup, I think, so you can do things like

r.html.find('#myElementID').text

which returns the content of the HTML element as you'd expect.

marvb
  • 1,209
  • 2
  • 8
  • 3
  • 4
    Shouldn't it be `r.html.find('#myElementID').text`? And also `r = session.get('http://www.yourjspage.com')`? – phrogg Jan 30 '19 at 10:23
  • 4
    After fixing the issues that Phil pointed out, I still got "RuntimeError: Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead." – Joshua Stafford May 02 '19 at 22:52
  • This exists only for python 3 as far as I'm concerned. Is there anything Python 2.7 users can do here? – KubaFYI Aug 27 '19 at 14:11
  • 28
    @KubaFYI Yes, they can start moving things over to python3 – Alejandro Braun Oct 27 '19 at 19:47
  • 6
    @HuckIt To solve this problem, you'll import `AsyncHTMLSession` instead of `HTMLSession` and the render will be called with `await session.get(url).result().arender()`. I just got this problem and this is how I solved it. – Vanessa Feb 27 '20 at 04:12
  • 3
    As it's written in its doc https://requests.readthedocs.io/projects/requests-html/en/latest/#javascript-support requests_html uses Chromium in the background. So it's Chromium browser controlled by a requests-like wrapper. – Sinan Cetinkaya Sep 04 '20 at 21:26
  • Unfortunately, requests-html is python 3.6 only, and I'm on 3.8 – Eric Nelson Feb 16 '21 at 02:24
  • r.html.render() Is there any way to execute it in chrome browser? I am getting "this browser is not supported. Please reconnect using the Chrome browser.". Which I have set for using browser other than chrome. – Akash Patel Apr 08 '21 at 06:02
  • @EricNelson It still works for me in 3.9. I don't believe there's any real issues with using a module built for 3.6. – Corsaka Sep 06 '21 at 13:58
  • 1
    From the documentation: `Note, the first time you ever run the render() method, it will download Chromium into your home directory. This only happens once.` – asmaier Feb 06 '22 at 14:06
49

You are going to have to make the same request (using the Requests library) that the javascript is making. You can use any number of tools (including those built into Chrome and Firefox) to inspect the http request that is coming from javascript and simply make this request yourself from Python.

sberry
  • 128,281
  • 18
  • 138
  • 165
  • 4
    So there is no way to have requests use javascript. – biw Oct 15 '14 at 22:57
  • 16
    No, Requests is an http library. It cannot run javascript. – sberry Oct 16 '14 at 20:48
  • 2
    I used Chrome tools to debug the website and look for what the Javascript was calling. You can see the results of what I created at https://github.com/719Ben/myCUinfo-API – biw Jan 10 '18 at 20:50
  • So far this is the best. You can also get nice JSON so its easier to get data – cikatomo Sep 11 '20 at 08:12
36

While Selenium might seem tempting and useful, it has one main problem that can't be fixed: performance. By calculating every single thing a browser does, you will need a lot more power. Even PhantomJS does not compete with a simple request. I recommend that you will only use Selenium when you really need to click buttons. If you only need javascript, I recommend PyQt (check https://www.youtube.com/watch?v=FSH77vnOGqU to learn it).

However, if you want to use Selenium, I recommend Chrome over PhantomJS. Many users have problems with PhantomJS where a website simply does not work in Phantom. Chrome can be headless (non-graphical) too!

First, make sure you have installed ChromeDriver, which Selenium depends on for using Google Chrome.

Then, make sure you have Google Chrome of version 60 or higher by checking it in the URL chrome://settings/help

Now, all you need to do is the following code:

from selenium.webdriver.chrome.options import Options
from selenium import webdriver

chrome_options = Options()
chrome_options.add_argument("--headless")

driver = webdriver.Chrome(chrome_options=chrome_options)

If you do not know how to use Selenium, here is a quick overview:

driver.get("https://www.google.com") #Browser goes to google.com

Finding elements: Use either the ELEMENTS or ELEMENT method. Examples:

driver.find_element_by_css_selector("div.logo-subtext") #Find your country in Google. (singular)
  • driver.find_element(s)_by_css_selector(css_selector) # Every element that matches this CSS selector
  • driver.find_element(s)_by_class_name(class_name) # Every element with the following class
  • driver.find_element(s)_by_id(id) # Every element with the following ID
  • driver.find_element(s)_by_link_text(link_text) # Every with the full link text
  • driver.find_element(s)_by_partial_link_text(partial_link_text) # Every with partial link text.
  • driver.find_element(s)_by_name(name) # Every element where name=argument
  • driver.find_element(s)_by_tag_name(tag_name) # Every element with the tag name argument

Ok! I found an element (or elements list). But what do I do now?

Here are the methods you can do on an element elem:

  • elem.tag_name # Could return button in a .
  • elem.get_attribute("id") # Returns the ID of an element.
  • elem.text # The inner text of an element.
  • elem.clear() # Clears a text input.
  • elem.is_displayed() # True for visible elements, False for invisible elements.
  • elem.is_enabled() # True for an enabled input, False otherwise.
  • elem.is_selected() # Is this radio button or checkbox element selected?
  • elem.location # A dictionary representing the X and Y location of an element on the screen.
  • elem.click() # Click elem.
  • elem.send_keys("thelegend27") # Type thelegend27 into elem (useful for text inputs)
  • elem.submit() # Submit the form in which elem takes part.

Special commands:

  • driver.back() # Click the Back button.
  • driver.forward() # Click the Forward button.
  • driver.refresh() # Refresh the page.
  • driver.quit() # Close the browser including all the tabs.
  • foo = driver.execute_script("return 'hello';") # Execute javascript (COULD TAKE RETURN VALUES!)
Lil Taco
  • 515
  • 5
  • 9
0

Using Selenium or jQuery enabled requests are slow. It is more efficient to find out which cookie is generated after website checking for JavaScript on the browser and get that cookie and use it for each of your requests.

In one example it worked through following cookies:

the cookie generated after checking for javascript for this example is "cf_clearance". so simply create a session. update cookie and headers as such:

s = requests.Session()
s.cookies["cf_clearance"] = "cb4c883efc59d0e990caf7508902591f4569e7bf-1617321078-0-150"
s.headers.update({
            "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
               AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
        })
s.get(url)

and you are good to go no need for JavaScript solution such as Selenium. This is way faster and efficient. you just have to get cookie once after opening up the browser.

MML
  • 3
  • 2
Yousuf
  • 13
  • 5
0

Some way to do that is to invoke your request by using selenium. Let's install dependecies by using pip or pip3:

pip install selenium

etc.

If you run script by using python3 use instead:

pip3 install selenium

(...)

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(ChromeDriverManager().install())
url = 'http://myurl.com'

# Please wait until the page will be ready:
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.some_placeholder")))
element.text = 'Some text on the page :)' # <-- Here it is! I got what I wanted :)
0

Maybe someone will benefit from my experience. It was tender for me to collect information from the website of the Pyaterochka store. The first page was returned as html, but the subsequent ones as a java script.

from requests_html import HTMLSession

session = HTMLSession()

def fetch(url, params):
    headers = params['headers']
    return session.get(url, headers=headers)

current_page = 1

req = fetch(
    f"https://5ka.ru/api/v2/special_offers/?records_per_page=15&page={current_page}&store=31Z6&ordering=&price_promo__gte=&price_promo__lte=&categories=&search=",
    {
        "headers": {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/111.0",
            "Accept": "application/json, text/plain, */*",
            "Accept-Language": "ru-RU,ru;q=0.8,en-US;q=0.5,en;q=0.3",
        },
    })

for pp in req.json()['results']:
    print(f'\nname = {pp["name"]}')
    print(f'price = {pp["current_prices"]["price_promo__min"]}')
    print(f'url = {pp["img_link"]}')
Roman
  • 1
-1

its a wrapper around pyppeteer or smth? :( i thought its something different

    @property
    async def browser(self):
        if not hasattr(self, "_browser"):
            self._browser = await pyppeteer.launch(ignoreHTTPSErrors=not(self.verify), headless=True, args=self.__browser_args)

        return self._browser