0

I am trying to get the Market cost from this website, but I am not being able to get the price from this specific website, I read in other topics that this could happen because I am using urllib so the mod_security is blocking the user agent, is this the case here?

What can I do to return the market cost from the page?

import urllib.request
from urllib.request import urlopen
import re


htmlfile = urlopen("http://xiv-market.com/item_details.php?id=2727")

htmltext = htmlfile.read()

regex = b'<h2 class="details">Market Cost: <img src="images/gil.png" width="24px" height="23px" style="margin-bottom:-5px;" alt="Gil">(.+?)</h2>'

pattern = re.compile(regex)

price = re.findall(pattern, htmltext) 

print(price)

Here is the error

Traceback (most recent call last):
  File "C:/Python34/Gw2.py", line 6, in <module>
    htmlfile = urlopen("http://xiv-market.com/item_details.php?id=2727")
  File "C:\Python34\lib\urllib\request.py", line 153, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python34\lib\urllib\request.py", line 461, in open
    response = meth(req, response)
  File "C:\Python34\lib\urllib\request.py", line 574, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Python34\lib\urllib\request.py", line 499, in error
    return self._call_chain(*args)
  File "C:\Python34\lib\urllib\request.py", line 433, in _call_chain
    result = func(*args)
  File "C:\Python34\lib\urllib\request.py", line 582, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
Squexis
  • 97
  • 1
  • 4
  • 12

2 Answers2

1

Well, the case seems similar with the one in this thread: HTTP error 403 in Python 3 Web Scraping

Stefano states that "This is probably because of mod_security or some similar server security feature which blocks known spider/bot user agents (urllib uses something like python urllib/3.3.0, it's easily detected). Try setting a known browser user agent with:"

Here is code for your example:

import urllib.request
from urllib.request import urlopen
import re

htmlfile = Request('http://xiv-market.com/item_details.php?id=2727', headers={'User-Agent': 'Mozilla/5.0'})
htmltext = urlopen(htmlfile).read()

regex = b'<h2 class="details">Market Cost: <img src="images/gil.png" width="24px" height="23px" style="margin-bottom:-5px;" alt="Gil" />(.+?)</h2>\n'
pattern = re.compile(regex)

price = re.findall(pattern, htmltext) 

print( price )

Looks like this is working. I also changed the regex a little bit to get a result. Hope this will help.

Community
  • 1
  • 1
Serc
  • 106
  • 3
0

You would need to know the exact reason that you are receiving a 403 error page in order to find an absolute work around. There are many causes that could produce that error. If you wish to attempt to circumvent it by providing user agent data, you'd need to build a full request and include user agent data in the headers of your request.

Example:

req = urllib.request.Request(
    url, 
    data=None, 
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
    }
)

f = urllib.request.urlopen(req)

Python Documentation

David Scott
  • 796
  • 2
  • 5
  • 22