EDIT2: SOLVED! See answer below regarding proper importing. from lib.bs4 import BeautifulSoup
instead of just from bs4 import BeautifulSoup
EDIT: Putting bs4 in the root of the project seems to resolve the issue; however, it isn't an ideal structure. So, I am leaving this question active to try and get to a more robust solution.
A variation of this question has been asked in the past, but the solutions there do not seem to work. I'm unsure if that is because of a change with BeautifulSoup or with Appengine, to be honest.
See: Python 2.7 : How to use BeautifulSoup in Google App Engine?, How to include third party Python libraries in Google App Engine?, and Which version of BeautifulSoup works with GAE (python 2.5)?
The solution proposed by Lipis seems to be adding the 3rd party library to a libs folder in the root of the project then adding the following to the main application:
import sys
sys.path.insert(0, 'libs')
Currently, my structure is this:
ntj-test
├── lib
│ └── bs4
├── templates
├── main.py
├── get_data.py
└── app.yaml
Here is my app.yaml:
application: ntj-test
version: 1
runtime: python27
api_version: 1
threadsafe: yes
handlers:
- url: /favicon\.ico
static_files: favicon.ico
upload: favicon\.ico
- url: .*
script: main.app
libraries:
- name: webapp2
version: latest
- name: jinja2
version: latest
Here is my main.py:
import webapp2
import jinja2
import get_data
import sys
sys.path.insert(0, 'lib')
JINJA_ENVIRONMENT = jinja2.Environment(
loader=jinja2.FileSystemLoader('templates'),
extensions=['jinja2.ext.autoescape'],
autoescape=True,
)
class MainHandler(webapp2.RequestHandler):
def get(self):
teamName = get_data.all_coach_data()[1]
coachName = get_data.all_coach_data()[2]
teamKey = get_data.all_coach_data()[0]
values = {
'coachName': coachName,
'teamName': teamName,
'teamKey': teamKey,
}
template = JINJA_ENVIRONMENT.get_template('index.html')
self.response.write(template.render(values))
app = webapp2.WSGIApplication([
('/', MainHandler)
], debug=True)
get_data.py returns the correct data to my variables for populating values, which I have verified in the debugger.
The problem comes when launching main.py in my development environment (I haven't uploaded to gcloud yet). Without fail, regardless of the nifty tricks I've discovered through the above links or throughout my Google searching, the terminal always returns:
Import Error: No module named bs4
In one of the SO links from above, a commenter says "GAE support only Pure Python Modules. bs4 is not pure because some parts were written in C." I am not sure if this is true or not, and I'm unsure how to verify it. I don't have enough reputation to comment to find out. :(
I have been through the bs4 docs on Crummy's website, I have read all of the related SO questions and answers, and I have tried to glean hints from Appengine's documentation. However, I have been unable to find a solution that doesn't involving using the deprecated version of BeautifulSoup, which doesn't have the functionality I need.
I'm a beginner to programming and using StackOverflow, so if I have left out some important piece of information or not followed good practices with the question, please let me know. I will edit and add additional information where necessary.
Thank you!
EDITS: I wasn't sure if the get_data code would be overkill, but here it is:
from bs4 import BeautifulSoup
import urllib2, re
teamKeys = {
'ATL': 'Atlanta Falcons',
'HOU': 'Houston Texans',
}
def get_all_coaches():
for key in teamKeys:
page = urllib2.urlopen("http://www.nfl.com/teams/coaches?coaType=head&team=" + key)
soup = BeautifulSoup(page)
return(head_coach(soup))
def head_coach(soup):
head = soup.select('.coachprofiletext p')[0].text
position, name = re.split(': ', head)
return name
def export_coach_data():
testList = []
for key in teamKeys:
page = urllib2.urlopen("http://www.nfl.com/teams/coaches?coaType=head&team=" + key)
soup = BeautifulSoup(page)
teamKey = key
teamName = teamKeys[key]
headCoach = head_coach(soup)
t = [
teamKey,
teamName,
str(headCoach),
]
testList.append(t)
return(testList)
def all_coach_data():
results = data.export_coach_data()
ATL = results[0]
HOU = results[1]
return ATL
I'd like to point out that this is probably littered with poor execution (I've only been developing in earnest for a couple months in my spare time), but it does return the correct values to my variables in main.
Here is the Appengine Launcher log:
2014-11-05 15:36:53 Running command: "['C:\\Python27\\pythonw.exe', 'C:\\Program Files\\Google\\Cloud SDK\\google-cloud-sdk\\platform\\google_appengine\\dev_appserver.py', '--skip_sdk_update_check=yes', '--port=11080', '--admin_port=8003', u'G:\\projects\\coaches']"
INFO 2014-11-05 15:37:00,119 devappserver2.py:725] Skipping SDK update check.
WARNING 2014-11-05 15:37:00,157 api_server.py:383] Could not initialize images API; you are likely missing the Python "PIL" module.
INFO 2014-11-05 15:37:00,190 api_server.py:171] Starting API server at: http://localhost:19713
INFO 2014-11-05 15:37:00,210 dispatcher.py:183] Starting module "default" running at: http://localhost:11080
INFO 2014-11-05 15:37:00,216 admin_server.py:117] Starting admin server at: http://localhost:8003
ERROR 2014-11-05 20:37:48,726 wsgi.py:262]
Traceback (most recent call last):
File "C:\Program Files\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\google\appengine\runtime\wsgi.py", line 239, in Handle
handler = _config_handle.add_wsgi_middleware(self._LoadHandler())
File "C:\Program Files\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\google\appengine\runtime\wsgi.py", line 298, in _LoadHandler
handler, path, err = LoadObject(self._handler)
File "C:\Program Files\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\google\appengine\runtime\wsgi.py", line 84, in LoadObject
obj = __import__(path[0])
File "G:\projects\coaches\main.py", line 3, in <module>
import get_data
File "G:\projects\coaches\get_data.py", line 1, in <module>
from bs4 import BeautifulSoup
ImportError: No module named bs4
INFO 2014-11-05 15:37:48,762 module.py:652] default: "GET / HTTP/1.1" 500 -