0

I want to get some data from website, for example : image URL, Page Title etc.

But response is not good.

Code :

import urllib2
from bs4 import BeautifulSoup

url_list = [
    "https://www.nfm.com/DetailsPage.aspx?productid=43382514"
]

# Image URLhttps://www.nfm.com/GetPhoto.ashx?ProductID=43382514&Size=L


def get_data(url):
    user_agent = '"Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"'
    headers = {'User-Agent': user_agent}
    page = urllib2.Request(url, None, headers)
    page2 = urllib2.urlopen(page)
    soup = BeautifulSoup(page2, 'html.parser')
    print soup.prettify('latin-1')
    # img_url = https://www.nfm.com/GetPhoto.ashx?ProductID=43382514&Size=L

for i in url_list:
    get_data(i)

Result is:

<html>
 <body>
  <script type="text/javascript">
   document.cookie="ns_cls="+"w:"+screen.width+",h:"+screen.height+",ua:"+escape(navigator.userAgent)
window.location.href = "**https://www.nfm.com/DetailsPage.aspx?productid=43382514**"
  </script>
 </body>
</html>

So, I am getting this HTML page. Includes the URL i am calling through python script (urllib2 module)

Even Response Module of python react as same!

I don't know how to get proper response!! Please Help !

shiv shankar
  • 19
  • 1
  • 6
  • Because that is what it gets, the information you retrieve for the page is correct. Did you try blocking your cookies in your web browser when you access to that page? Maybe you should try another approach to get the content. – Víctor M Feb 08 '16 at 06:51
  • How @VíctorM ?? Please give me some idea – shiv shankar Feb 08 '16 at 06:52
  • I have found this: http://stackoverflow.com/questions/1418082/is-it-possible-to-hide-the-browser-in-selenium-rc. You can use it with BeautifulSoup and get the content – Víctor M Feb 08 '16 at 07:13
  • 1
    You could simply forge the cookie by yourself and add it to your header in the urrlib request. The recipe is right in the response – Abaddon666 Feb 08 '16 at 07:14
  • @VíctorM Selenium is super slow!!! I need to work on urllib2 – shiv shankar Feb 08 '16 at 07:36
  • @shivshankar then set the cookies as Abaddon666 wrote – Víctor M Feb 08 '16 at 07:43
  • Use `requests` module and set the cookies value: `cookies = {'ns_cls': 'w:800,h:600,ua:' + user_agent}`. Then call it with `requests.get(url, cookies=cookies)`. Please remove the simple or double quotes in the `user_agent` variable. – Víctor M Feb 08 '16 at 08:19
  • @VíctorM, Its working Bro :) Thankyou So much – shiv shankar Feb 08 '16 at 08:52

0 Answers0