3

I am trying to write a python (version 2.7.5) CGI script on a Centos7 server. My script attempt to download data from librivox's webpage like ... https://librivox.org/selections-from-battle-pieces-and-aspects-of-the-war-by-herman-melville/ and my script bombs out with this error:

<class 'urllib2.URLError'>: <urlopen error [Errno 13] Permission denied> 
      args = (error(13, 'Permission denied'),) 
      errno = None 
      filename = None 
      message = '' 
      reason = error(13, 'Permission denied') 
      strerror = None

I have shutdown iptables I can do things like `wget -O- https://librivox.org/selections-from-battle-pieces-and-aspects-of-the-war-by-herman-melville/' without error. Here is the bit of code were the error occurs:

def output_html ( url, appname, doobb ):
        print "url is %s<br>" % url
        soup = BeautifulSoup(urllib2.urlopen( url ).read())

Update: Thanks Paul and alecxe I have updated my code to be like so:

def output_html ( url, appname, doobb ):
        #hdr = {'User-Agent':'Mozilla/5.0'}
        #print "url is %s<br>" % url
        #req = url2lib2.Request(url, headers=hdr)
        # soup = BeautifulSoup(urllib2.urlopen( url ).read())
        headers = {'User-Agent':'Mozilla/5.0'}
        # headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.99 Safari/537.36'}
        response = requests.get( url, headers=headers)

        soup = BeautifulSoup(response.content)

... and I get a slightly different error when ...

response = requests.get( url, headers=headers)

... gets called ...

<class 'requests.exceptions.ConnectionError'>: ('Connection aborted.', error(13, 'Permission denied')) 
      args = (ProtocolError('Connection aborted.', error(13, 'Permission denied')),) 
      errno = None 
      filename = None 
      message = ProtocolError('Connection aborted.', error(13, 'Permission denied')) 
      request = <PreparedRequest [GET]> 
      response = None 
      strerror = None

... the funny thing is wrote a command line version of this script and it works fine and looks something like this ...

def output_html ( url ):
        soup = BeautifulSoup(urllib2.urlopen( url ).read())

Very strange don't you think?

Update: This question may already have an answer here: urllib2.HTTPError: HTTP Error 403: Forbidden 2 answers

NO THEY DO NOT ANSWER THE QUESTION

Red Cricket
  • 9,762
  • 21
  • 81
  • 166
  • 1
    Did you try adding other headers to the request? like http://stackoverflow.com/questions/13303449/urllib2-httperror-http-error-403-forbidden – Paul Rooney Jan 22 '15 at 04:39

3 Answers3

6

Finally figured it out ...

# grep python /var/log/audit/audit.log | audit2allow -M mypol
# semodule -i mypol.pp
Red Cricket
  • 9,762
  • 21
  • 81
  • 166
  • 2
    This helped me SO much by putting me on the right track. Thanks! SELinux on CentOS 7 was blocking a Python call urllib/urllib2/requests from a .py file, but not from the Python command line and the error messages were not helpful. It was driving me crazy. – John Marion Aug 04 '15 at 21:41
1

Using requests and providing a User-Agent header works for me:

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.99 Safari/537.36'}
response = requests.get("https://librivox.org/selections-from-battle-pieces-and-aspects-of-the-war-by-herman-melville/", headers=headers)

soup = BeautifulSoup(response.content)
print soup.title.text  # "prints LibriVox"
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
0

We were having this same issue with one of our machines. Instead of creating an SELinux module (as listed in answer above) we made the following change to an SELinux Boolean to prevent similar errors from happening

# setsebool httpd_can_network_connect on

As explained on the centos wiki

httpd_can_network_connect (HTTPD Service):: Allow HTTPD scripts and modules to connect to the network.

Gordster
  • 101
  • 4