1

Python 3.4.2... I have been trying to dynamically load a custom module from an argument. I want to load custom code to scrape specific HTML files. Example: scrape.py -m name_of_module_to_load file_to_scrape.html

I have tried an number of solutions including this one: importing a module when the module name is in a variable

The module loads fine when I use the actual module name instead of the variable name args.module.

Code:

$ cat scrape.py 
#!/usr/bin/env python3
from urllib.request import urlopen
from bs4 import BeautifulSoup
import argparse
import os, sys
import importlib

parser = argparse.ArgumentParser(description='HTML web scraper')
parser.add_argument('filename', help='File to act on')
parser.add_argument('-m', '--module', metavar='MODULE_NAME', help='File with code specific to the site--must be a defined class named Scrape')
args = parser.parse_args()

if args.module:
#    from get_div_content import Scrape #THIS WORKS#
    sys.path.append(os.getcwd())
    #EDIT--change this:
    #wrong# module_name = importlib.import_module(args.module, package='Scrape')
    #to this:
    module = importlib.import_module(args.module) # correct

try:
    html = open(args.filename, 'r')
except:
    try:
    html = urlopen(args.filename)
    except HTTPError as e:
    print(e)
try:
    soup = BeautifulSoup(html.read())
except:
    print("Error... Sorry... not sure what happened")

#EDIT--change this
#wrong#scraper = Scrape(soup)
#to this:
scraper = module.Scrape(soup) # correct

Module:

$ cat get_div_content.py 
class Scrape:
    def __init__(self, soup):
    content = soup.find('div', {'id':'content'})
    print(content)

Command run and Error:

$ ./scrape.py -m get_div_content.py file.html 
Traceback (most recent call last):
  File "./scrape.py", line 16, in <module>
    module_name = importlib.import_module(args.module, package='Scrape')
  File "/usr/lib/python3.4/importlib/__init__.py", line 109, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 2249, in _gcd_import
  File "<frozen importlib._bootstrap>", line 2199, in _sanity_check
SystemError: Parent module 'Scrape' not loaded, cannot perform relative import

Working Command -- No Errors:

$ ./scrape.py -m get_div_content file.html
<div id="content">
...
</div>
ajnabi
  • 167
  • 2
  • 3
  • 14

1 Answers1

2

You don't need a package. Use only the module name

module = importlib.import_module(args.module)

then you have a module namespace with everything that was defined in the module:

scraper = module.Scrape(soup)

Remember, when calling, to use the module name, not the filename:

./scrape.py -m get_div_content file.html 
nosklo
  • 217,122
  • 57
  • 293
  • 297