16

I am testing using the requests module to get the content of a webpage. But when I look at the content I see that it does not get the full content of the page.

Here is my code:

import requests
from bs4 import BeautifulSoup

url = "https://shop.nordstrom.com/c/womens-dresses-shop?origin=topnav&cm_sp=Top%20Navigation-_-Women-_-Dresses&offset=11&page=3&top=72"
page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

Also on the chrome web-browser if I look at the page source I do not see the full content.

Is there a way to get the full content of the example page that I have provided?

TJ1
  • 7,578
  • 19
  • 76
  • 119
  • 5
    "Also on the chrome web-browser if I look at the page source I do not see the full content." Why do you blame `requests` then? – Elis Byberi Dec 09 '17 at 16:34
  • The page is probably generated dynamically by javascript running in the browser. This is very common, and there are many questions here on stackoverflow that address this exact issue. – larsks Dec 09 '17 at 16:40
  • it's probably like @larsks said , can you tell us more details, what's the missing part of code you can't see it when you show source code in browser ? – Ahmad Nourallah Dec 09 '17 at 16:46
  • @ElisByberi I do not blame `requests`, I am just saying I am using requests. – TJ1 Dec 09 '17 at 17:06

2 Answers2

22

The page is rendered with JavaScript making more requests to fetch additional data. You can fetch the complete page with selenium.

from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
url = "https://shop.nordstrom.com/c/womens-dresses-shop?origin=topnav&cm_sp=Top%20Navigation-_-Women-_-Dresses&offset=11&page=3&top=72"
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
print(soup.prettify())

For other solutions see my answer to Scraping Google Finance (BeautifulSoup)

Dan-Dev
  • 8,957
  • 3
  • 38
  • 55
  • Thanks, when I try to run your code I get this error: FileNotFoundError: [Errno 2] No such file or directory: 'chromedriver' – TJ1 Dec 09 '17 at 17:04
  • 1
    You need to download ChromeDriver and put it in your path https://sites.google.com/a/chromium.org/chromedriver/ – Dan-Dev Dec 09 '17 at 17:06
  • you can use a headless version of chrome "Chrome Canary" if you are on Windows. – Dan-Dev Dec 09 '17 at 17:09
  • I am on Mac, I copied the chromedriver in the same place that my python source code is but still I get error. – TJ1 Dec 09 '17 at 17:11
  • It's a long time since I used a Mac on Linux you put it in /usr/local/bin/ is it the same on a Mac? – Dan-Dev Dec 09 '17 at 17:12
  • There is an OSX build of Chrome canary too see https://www.google.co.uk/chrome/browser/canary.html – Dan-Dev Dec 09 '17 at 17:22
  • Thanks I noticed that I need to add the path to chromedriver in driver = webdriver.Chrome(). Now it works fine. I accepted your answer :) – TJ1 Dec 09 '17 at 17:22
  • I still didnt end up with any extra info added to the html file – Stellan May 27 '20 at 11:29
  • is there any other alternative to web-driver? Web driver makes the scraping proccess slow, i want to get the data in a faster way – sheetal Nov 28 '20 at 07:53
  • Google `web scraping with PyQT5`, `requests-html` (not requests) or `scrapy splash` these may be faster but rendering JavaScript takes more time than just downloading a page. – Dan-Dev Nov 28 '20 at 12:21
  • Each time I execute this snippet the browser is run. Is there a way to do this task without running the browser ? – pentanol Feb 13 '21 at 11:50
  • Using Selenium you can on some operating systems run it as headless. The browser will still run but not open a window. Otherwise, you can look at using alternatives like requests-html or PyQT5. A Google search of "Selenium headless", "requests-html" or "web scraping with PyQT5" should yield results. – Dan-Dev Feb 13 '21 at 11:56
-4

Request is different from getting page source or visual elements of the web page, also viewing source from web page doesn't give you full access to everything that is on the web page including database requests and other back-end stuff. Either your question is not clear enough or you've misinterpreted how web browsing works.