0

I am NOT new to general programming, but I am new to larger programs requiring creating/importing my own modules. I've done it in c language before, but it was years ago.. and this is Python.

I'm looking for guidance on organizing. I finally figured out HOW to import a .PY file into my project (and what it should maybe look like) as well as adding paths to windows variables, but now I am curious about if I'm doing things 'correct' or what is the best practice. Below, I included a list of links I've already read that didn't answer my questions, but figured I would try to make this thread a one-stop-shop as I've seen this is a little bit of a hot topic over the years.

I'm trying to make a kindof all-in-one module full of functions for scraping so I can do like I did in the test file and just simply write ONE line to do what I need. I.e. pass in a URL and get back a sorted list of all HTML tags and their frequency in the page. (This is just something to experiment while trying to learn organization and external files) It's a pain, because if something goes wrong, I have to change all kinds of files.

I'm getting errors like: "request = scraper_tools.get_request(url, data=None, headers=scraper.reg_header)NameError: name 'scraper' is not defined"

Am I doing it wrong, and is there a better way? (I assume there is) :)

My code goes like this:

scraper_tools.py

#!my_modules/python
# Filename: scraper_tools.py

import requests import bs4 as bs

phone_header = {'user-agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 9_2 like mac OS X)'} reg_header = {'user-agent': 'Mozilla/5.0 (Windows NT
6.1; rv:52.0) Gecko/20100101 Firefox/52.0'}

def make_soup(request, parser):
    # Make soup
    return bs.BeautifulSoup(request.text, parser)

def get_request (url, data, headers, **kwargs):
    if kwargs and not headers:
        try:
            return requests.get(url, data=data)
        except Exception as e:
            print(e)
    elif headers and kwargs:
        return requests.get(url, headers=headers, data=data)


def get_all_items(soup, tag):
    return soup.find_all(tag)

def open_file_write(path, filename):
    save_path = path
    return open(os.path.join(save_path, filename), 'w')

def get_all_links(self, soup):
    href_tags = soup.find_all(href=True)
    link_list = []

    for tag in href_tags:
        if 'http' in tag['href'][0:4]:
            link_list.append(tag['href'])

    return link_list

get_all_tags.py

from stevens_tools import scraper_tools
import operator
import requests

'''
Author: Steven Smith
Email: StevenSmithCIS@gmail.com
Date: 9/1/2017
Description: This file uses an online resource website to dynamically get all 
common HTML tags in a list to be used to count list elements inside a specific
web page (and therefore know something about the quantity of each particular tag).
'''

html_tag_website_url = 'https://www.quackit.com/html/tags'
soup = None
tag_qnty_dict = None
request_object = None

def get_html_tags():
    #Get all the HTML tags currently from the website
    all_tags = []
    ul_lists = soup.find_all('ul', {'class': 'col-3 taglist'})
    for li in ul_lists:
        for item in li.find_all('a'):
            all_tags.append(item.text)
    return all_tags

def get_all_tags_from(url):
    #Returns a dictionary of all tags from HTML tag website in passsed in URL
    #with tag and quanity listed
    request = scraper_tools.get_request(url, data=None, headers=scraper_tools.reg_header)
    soup = scraper_tools.make_soup(request, 'lxml')
    tag_qnty_dict = {}
    tags = get_html_tags_from_file()
    if tags:
        for tag in tags:
            # If there is more than 0 items, add to list
            item_qnty = len(scraper_tools.get_all_items(soup, tag))
            if item_qnty > 0:
                tag_qnty_dict.update({tag: item_qnty})
    return tag_qnty_dict

def sort_items(reverse):
    #Sorts items in tag dictionary by quantity. In reverse (largest first)
    # if reverse is True
    return sorted(tag_qnty_dict.items(), key=operator.itemgetter(1), reverse=reverse)

def print_all():
    for item in sort_items(True):
        print('Tag = ' + item[0] + " Quantity: = " + str(item[1]))

test_tag_counter.py

from stevens_tools import get_all_tags

get_all_tags.print_all(get_all_tags.get_all_tags_from('https://www.goodreads.com/list/tag/best'))

^^^^^^^^^^^^^^^^^^ Not too crazy about those names, but.. they're descriptive! lol

**Other topics I've visited

Python Packages and Modules (..new to importing modules/packages in Python) http://mikegrouchy.com/blog/2012/05/be-pythonic-init__py.html (using __init.py for module/package identifier) create Python package and import modules (import each file vs one time) Why installing package and module not same in Python? (import version problem -Python 3.4 vs 2x) What's the difference between a Python module and a Python package? (<-- see name lol) What's the difference between "package" and "module" (<-- see name) Remove package and module name from sphinx function (removing module name) importing package and modules from another directory in python(<-- using sys) Best practices when importing in IPython http://docs.python-guide.org/en/latest/writing/structure/#modules

Steven S
  • 7
  • 4
  • Well, you have a module named `scraper_tools`, not `scraper`. – chepner Sep 01 '17 at 16:35
  • Oh yes. I changed the names on some things to make them more descriptive. (hopefully). It's mostly just my first real experiment with using external files in a program and wondering if I got it 'basically' correct, or if there is a standard way of doing things. As you probably see, I have a small module for basic setting up of handing requests and making beautifulSoup, and then another one that is for what happens AFTER I get all the 'whatever' from the site(s). Well, I'm gonna keep working on it. Thank you! – Steven S Sep 01 '17 at 23:10

0 Answers0