5

EDIT2: SOLVED! See answer below regarding proper importing. from lib.bs4 import BeautifulSoup instead of just from bs4 import BeautifulSoup

EDIT: Putting bs4 in the root of the project seems to resolve the issue; however, it isn't an ideal structure. So, I am leaving this question active to try and get to a more robust solution.

A variation of this question has been asked in the past, but the solutions there do not seem to work. I'm unsure if that is because of a change with BeautifulSoup or with Appengine, to be honest.

See: Python 2.7 : How to use BeautifulSoup in Google App Engine?, How to include third party Python libraries in Google App Engine?, and Which version of BeautifulSoup works with GAE (python 2.5)?

The solution proposed by Lipis seems to be adding the 3rd party library to a libs folder in the root of the project then adding the following to the main application:

import sys
sys.path.insert(0, 'libs')

Currently, my structure is this:

ntj-test
├── lib
│   └── bs4 
├── templates
├── main.py
├── get_data.py 
└── app.yaml

Here is my app.yaml:

application: ntj-test
version: 1
runtime: python27
api_version: 1
threadsafe: yes

handlers:
- url: /favicon\.ico
  static_files: favicon.ico
  upload: favicon\.ico

- url: .*
  script: main.app

libraries:
- name: webapp2
  version: latest
- name: jinja2
  version: latest

Here is my main.py:

import webapp2
import jinja2
import get_data
import sys

sys.path.insert(0, 'lib')

JINJA_ENVIRONMENT = jinja2.Environment(
    loader=jinja2.FileSystemLoader('templates'),
    extensions=['jinja2.ext.autoescape'],
    autoescape=True,
)


class MainHandler(webapp2.RequestHandler):
    def get(self):

        teamName = get_data.all_coach_data()[1]
        coachName = get_data.all_coach_data()[2]
        teamKey = get_data.all_coach_data()[0]

        values = {
            'coachName': coachName,
            'teamName': teamName,
            'teamKey': teamKey,
        }

        template = JINJA_ENVIRONMENT.get_template('index.html')
        self.response.write(template.render(values))

app = webapp2.WSGIApplication([
    ('/', MainHandler)
], debug=True)

get_data.py returns the correct data to my variables for populating values, which I have verified in the debugger.

The problem comes when launching main.py in my development environment (I haven't uploaded to gcloud yet). Without fail, regardless of the nifty tricks I've discovered through the above links or throughout my Google searching, the terminal always returns:

Import Error: No module named bs4

In one of the SO links from above, a commenter says "GAE support only Pure Python Modules. bs4 is not pure because some parts were written in C." I am not sure if this is true or not, and I'm unsure how to verify it. I don't have enough reputation to comment to find out. :(

I have been through the bs4 docs on Crummy's website, I have read all of the related SO questions and answers, and I have tried to glean hints from Appengine's documentation. However, I have been unable to find a solution that doesn't involving using the deprecated version of BeautifulSoup, which doesn't have the functionality I need.

I'm a beginner to programming and using StackOverflow, so if I have left out some important piece of information or not followed good practices with the question, please let me know. I will edit and add additional information where necessary.

Thank you!

EDITS: I wasn't sure if the get_data code would be overkill, but here it is:

from bs4 import BeautifulSoup
import urllib2, re

teamKeys = {
    'ATL': 'Atlanta Falcons',
    'HOU': 'Houston Texans',
}

def get_all_coaches():
    for key in teamKeys:
        page = urllib2.urlopen("http://www.nfl.com/teams/coaches?coaType=head&team=" + key)
        soup = BeautifulSoup(page)
        return(head_coach(soup))

def head_coach(soup):
    head = soup.select('.coachprofiletext p')[0].text
    position, name = re.split(': ', head)
    return name

def export_coach_data():
    testList = []
    for key in teamKeys:
        page = urllib2.urlopen("http://www.nfl.com/teams/coaches?coaType=head&team=" + key)
        soup = BeautifulSoup(page)
        teamKey = key
        teamName = teamKeys[key]
        headCoach = head_coach(soup)

        t = [
            teamKey,
            teamName,
            str(headCoach),
        ]

        testList.append(t)

    return(testList)

def all_coach_data():
    results = data.export_coach_data()

    ATL = results[0]
    HOU = results[1]

    return ATL

I'd like to point out that this is probably littered with poor execution (I've only been developing in earnest for a couple months in my spare time), but it does return the correct values to my variables in main.

Here is the Appengine Launcher log:

2014-11-05 15:36:53 Running command: "['C:\\Python27\\pythonw.exe', 'C:\\Program Files\\Google\\Cloud SDK\\google-cloud-sdk\\platform\\google_appengine\\dev_appserver.py', '--skip_sdk_update_check=yes', '--port=11080', '--admin_port=8003', u'G:\\projects\\coaches']"
INFO     2014-11-05 15:37:00,119 devappserver2.py:725] Skipping SDK update check.
WARNING  2014-11-05 15:37:00,157 api_server.py:383] Could not initialize images API; you are likely missing the Python "PIL" module.
INFO     2014-11-05 15:37:00,190 api_server.py:171] Starting API server at: http://localhost:19713
INFO     2014-11-05 15:37:00,210 dispatcher.py:183] Starting module "default" running at: http://localhost:11080
INFO     2014-11-05 15:37:00,216 admin_server.py:117] Starting admin server at: http://localhost:8003
ERROR    2014-11-05 20:37:48,726 wsgi.py:262] 

Traceback (most recent call last):

  File "C:\Program Files\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\google\appengine\runtime\wsgi.py", line 239, in Handle

    handler = _config_handle.add_wsgi_middleware(self._LoadHandler())

  File "C:\Program Files\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\google\appengine\runtime\wsgi.py", line 298, in _LoadHandler

    handler, path, err = LoadObject(self._handler)

  File "C:\Program Files\Google\Cloud SDK\google-cloud-sdk\platform\google_appengine\google\appengine\runtime\wsgi.py", line 84, in LoadObject

    obj = __import__(path[0])

  File "G:\projects\coaches\main.py", line 3, in <module>

    import get_data

  File "G:\projects\coaches\get_data.py", line 1, in <module>

    from bs4 import BeautifulSoup

ImportError: No module named bs4

INFO     2014-11-05 15:37:48,762 module.py:652] default: "GET / HTTP/1.1" 500 -
Community
  • 1
  • 1
nicholas
  • 509
  • 1
  • 6
  • 12
  • 1
    how are you actually doing the import? "from bs4 import BeautifulSoup"? – Paul Collingwood Nov 05 '14 at 20:07
  • You could just parse the HTML manually or extract the content you're looking for by finding patterns. – Ryan Nov 05 '14 at 20:08
  • Where is the code where you are doing that import? – Daniel Roseman Nov 05 '14 at 20:10
  • and how are you "launching" main.py. You don't run it like "python main.py" do you? – Paul Collingwood Nov 05 '14 at 20:12
  • I'll update my post to answer these questions. – nicholas Nov 05 '14 at 20:19
  • Did you try just putting the `bs4` dir in the main app dir? That's what I do when I use 3rd party libraries and it's worked fine (assuming it's pure python) – Ryan Nov 05 '14 at 20:28
  • @Ryan - I'm unfamiliar with what you describe. Could you give me a little more detail? Edit: this comment was referring to your first comment about parsing and extracting the HTML manually. Also, I did attempt to put the bs4 library in the root, unsuccessfully. – nicholas Nov 05 '14 at 20:39
  • @PaulCollingwood - I'm developing in Pycharm which has a "Launch Appengine application" option. I've also launched directly from the Google App Engine Launcher. I've put the log in the main post. Thank you! – nicholas Nov 05 '14 at 20:40
  • @Ryan CORRECTION! I AM able to successfully launch the application with bs4 in the root. However, is that the best implementation? I would prefer to implement some variation of Lipis's suggestion of putting it in a libs folder. – nicholas Nov 05 '14 at 20:51
  • To begin finding a way to parse manually, just print the output of `page` with `print page.read()` after you open the URL. That will display the HTML of the page, which you can then go through manually to find predictable structures around the content you're looking for. But since you're able to use beautifulsoap maybe you don't need to do that. – Ryan Nov 05 '14 at 21:02
  • I don't think there's any advantage to having a separate folder for 3rd party libraries versus having them in the main dir - aside from it being a little bit more organized. – Ryan Nov 05 '14 at 21:04
  • A Reddit user helped me figure out the answer. I've posted it as the solution to the OP. Thank you all for your help! – nicholas Nov 07 '14 at 22:00

4 Answers4

3

EDIT: It has been pointed out that this is a bit of a hack. If so, how can this solution be modified to not require renaming of modules inside BS4?

A couple users over at http://www.reddit.com/r/learnpython helped me solve this problem.

By expanding on the solution proposed by Lipis, we added the following to main.py:

import os, sys

rootdir = os.path.dirname(os.path.abspath(__file__))
lib = os.path.join(rootdir, 'lib')
sys.path.append(lib)

Then, and here's what no one ever mentioned here or in any of the other SO answers, I added "lib.bs4" to all of my import statements, as such:

from lib.bs4 import BeautifulSoup

But, not only that, there were references to bs4 within the bs4 library itself, so I searched for and replaced all of those with lib.bs4.<something>.

Now, finally, my app runs, and the structure is organized. All the credit goes to /u/invalidusemame and /u/prohulaelk.

Hopefully, this post helps someone else stuck in a similar situation. Maybe it should have been obvious that all the imports would need to have the added to the import statement, but it wasn't immediately obvious from all of the answers.

Thank you to everyone who helped troubleshoot!

nicholas
  • 509
  • 1
  • 6
  • 12
  • If you properly add `lib` to your sys path, you wouldn't have to add `lib.bs4` everywhere. Improper solution. When you update bs4, all your paths will be back to where you were. -1 – GAEfan Nov 08 '14 at 01:54
  • I still added lib to sys.path. What wasn't done properly? I understand your concern, but I'm not sure what the solution is. – nicholas Nov 08 '14 at 13:15
2

I believe your issue is a typo in main.py:

sys.path.insert(0, 'lib')

Your directory is libs, not lib.

GAEfan
  • 11,244
  • 2
  • 17
  • 33
  • That directory structure had a typo. The folder is named lib in my project, but I copied that structure from another question. I have edited the original post. – nicholas Nov 06 '14 at 14:17
  • This isn't the answer. It was just a typo. Can you delete it? – nicholas Nov 08 '14 at 17:20
1

Alternatively, you could create a file called appengine_config.py for loading third-party libs. This file will load when starting a new instance.

import sys
import os.path
# add `lib` subdirectory to `sys.path`, so our `main` module can load third-party libraries.
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'lib'))
tzmartin
  • 96
  • 7
0

OK, a couple of other fixes. Import lxml in app.yaml, under libraries:

libraries:
- name: lxml
  version: "2.3" <<- do NOT use "latest"

Make sure you have an __init__.py file in lib. I added some code there to make it self-attach:

import os
import sys

libs_directory = os.path.dirname(os.path.abspath(__file__))
if libs_directory not in sys.path:
    sys.path.insert(0, libs_directory)

root_directory = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
if root_directory not in sys.path:
    sys.path.insert(0, root_directory)
GAEfan
  • 11,244
  • 2
  • 17
  • 33
  • Do I need to make any modifications to the __init__.py file? If not, I have made these changes, and I get the same error. Should I remove the sys.path.insert from the main.py file? Thanks for your help! – nicholas Nov 06 '14 at 15:50
  • The code in `__init__.py` is not necessary. Your method of `sys.path.insert(0, 'lib')` should work. What version of bs4 did you install? (Open its `__init__.py` to check.) Did you add lxml to your `-libraries` in app.yaml? – GAEfan Nov 06 '14 at 15:58
  • I did add lxml to my app.yaml libraries. I removed the <<-- do not use part, of course. I'm using __version__ = "4.3.2" of bs4. I've tried the lib/__init__.py file with and without the code you posted above without success. Edit: I've updated the GH repo with my current code. – nicholas Nov 06 '14 at 16:09
  • After `sys.path.insert(0, 'lib')` add this: `import logging logging.info(sys.path)` to check what modules are available – GAEfan Nov 06 '14 at 16:15
  • Okay, I've done that and run the application with the same error. Where will I find the log? According to "print sys.path" it should be my root directory, but there wasn't any log there. – nicholas Nov 06 '14 at 16:29
  • A Reddit user helped me figure out the answer. I've posted it as the solution to the OP. Thank you all for your help! – nicholas Nov 07 '14 at 21:59