2

I am learning web scraping with python and using some libraries(Beautifulsoup and requests) to get the results. But when i am trying to pull the data of any web page let's say sears product url - https://www.sears.com/tradesman-talg1670-70-inch-economy-line-aluminum-gull/p-00937054000P?plpSellerId=Sears&prdNo=1&blockNo=1&blockType=G1 , so here i am not getting complete page source, i need to get product title, price, specifications etc.

I have found a url while checking in browser's console and it contains all product details in json format but i am still unable to pull these json data. Here is a url for json format - https://www.sears.com/content/pdp/config/products/v1/products/04403935070P?site=sears

And below are the codes for pulling source code:

from bs4 import BeautifulSoup
import requests
import re
import json

s = requests.session()  #start requests session    
page = s.get("https://www.sears.com/tradesman-talg1670-70-inch-economy-line-aluminum-gull/p-00937054000P?plpSellerId=Sears&prdNo=1&blockNo=1&blockType=G1")  #get the page
soup = BeautifulSoup(page.content) 

#print(soup.encode("utf-8"))
print(soup)

Please check these codes and suggest me for better solution, Thanks in advance.

Andersson
  • 51,635
  • 17
  • 77
  • 129
  • 1
    It may be that JavaScript isn't executed. You can also try your luck with https://html.python-requests.org – FMCorz Oct 01 '18 at 10:20
  • You need to use selenium which simulates browser as this page has dynamically rendered javascript, you can use any browser or hedless browser like phantom. – GraphicalDot Oct 01 '18 at 10:23
  • Had a look at the website and it looks like the data you want is being displayed by a Javascript call, so this post might help. https://stackoverflow.com/q/26393231/9742036 – Andrew McDowell Oct 01 '18 at 10:25
  • 3
    Why this question got so many upvotes? This is obviously common issue when OP cannot scrape dynamically rendered data with GET-request... – Andersson Oct 01 '18 at 10:27
  • @saurav verma i had already tried selenium and phantomjs but i could not get any expected results – vasdev chandrakar Oct 01 '18 at 10:27
  • because this website using javascrip to load data,so you can not scrape it all – KC. Oct 01 '18 at 10:27
  • Using your code, I can see the price in `soup`... what makes you think you can't? Have you tried specifying a parser in your call to `BeautifulSoup`? – Dan Oct 01 '18 at 10:28
  • @kcorlidy i have url with json data but still i am not able to pull any of the data. Here is url - https://www.sears.com/content/pdp/config/products/v1/products/04403935070P?site=sears – vasdev chandrakar Oct 01 '18 at 10:30
  • @vasdevchandrakar , the data comes from JSON, but you're requesting HTML source... What is the point? If you need data - make a request to URL with JSON, not to URL with HTML – Andersson Oct 01 '18 at 10:33
  • @dan, yes we had tried 2 parsers - lxml and html5lib but did not get complete source code which includes price and product title and so on.. – vasdev chandrakar Oct 01 '18 at 10:33
  • @Andersson if you have any reference regarding the same so please let me know – vasdev chandrakar Oct 01 '18 at 10:35
  • google how to use selenuim.and this question is over,btw i reall dont know why this question has so many upvotes – KC. Oct 01 '18 at 10:38
  • @Andersson: why did you reopen this? This is a general 'why won't a website give me the info I can see in the browser' question that would generally be closed as too broad or missing a MCVE. By duping it at least the OP has a path towards solving the issue themselves. – Martijn Pieters Oct 01 '18 at 23:18
  • @MartijnPieters , I do agree that your solution gives a lot of tips to solve "missing data" issue, but note that OP doesn't get specific HTTP status, OP knows how to use Network tab of browser dev console and knows where required data comes from. The problem is that OP think that data from all requests that browser sends while page rendering should be somehow combined in the page source... and your solution contains no obvious clarifications for this case IMHO – Andersson Oct 02 '18 at 07:44
  • @Andersson: *Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, ...* etc. – Martijn Pieters Oct 02 '18 at 10:34

0 Answers0