29

I've written a simple web scraper for a comic website. I'm running it on Ubuntu (Linux ubuntu 4.18.0-16-generic #17~18.04.1-Ubuntu) but when I execute the script (permissions set to chmod ug+x) I keep getting a series of errors with imported system libraries along with a confusing syntax error:

import-im6.q16: not authorized `time' @ error/constitute.c/WriteImage/1037.
import-im6.q16: not authorized `os' @ error/constitute.c/WriteImage/1037.
import-im6.q16: not authorized `sys' @ error/constitute.c/WriteImage/1037.
import-im6.q16: not authorized `re' @ error/constitute.c/WriteImage/1037.
import-im6.q16: not authorized `requests' @ error/constitute.c/WriteImage/1037.
from: can't read /var/mail/bs4
./poorlywrittenscraper.py: line 15: DEFAULT_DIR_NAME: command not found
./poorlywrittenscraper.py: line 16: syntax error near unexpected token `('
./poorlywrittenscraper.py: line 16: `COMICS_DIRECTORY = os.path.join(os.getcwd(), DEFAULT_DIR_NAME)'

Interestingly enough, when I run the same script via python3 it fires up, creates the folder, fetches the images but... does not save them. o.O

Any idea what am I missing here or how to fix this?

Here's the full code of the script:

"""
A simple image downloader for poorlydrawnlines.com/archive
"""
import time
import os
import sys
import re
import concurrent.futures


import requests
from bs4 import BeautifulSoup as bs


DEFAULT_DIR_NAME = "poorly_created_folder"
COMICS_DIRECTORY = os.path.join(os.getcwd(), DEFAULT_DIR_NAME)


LOGO = """
a Python comic(al) scraper for poorlydwarnlines.com
                         __
.-----.-----.-----.----.|  |.--.--.
|  _  |  _  |  _  |   _||  ||  |  |
|   __|_____|_____|__|  |__||___  |
|__|                        |_____|
                __ __   __
.--.--.--.----.|__|  |_|  |_.-----.-----.
|  |  |  |   _||  |   _|   _|  -__|     |
|________|__|  |__|____|____|_____|__|__|
.-----.----.----.---.-.-----.-----.----.
|__ --|  __|   _|  _  |  _  |  -__|   _|
|_____|____|__| |___._|   __|_____|__|
                      |__|
version: 0.4 | author: baduker | https://github.com/baduker
"""


ARCHIVE_URL = "http://www.poorlydrawnlines.com/archive/"
COMIC_PATTERN = re.compile(r'http://www.poorlydrawnlines.com/comic/.+')

def download_comics_menu(comics_found):
    """
    Main download menu, takes number of available comics for download
    """
    print("\nThe scraper has found {} comics.".format(len(comics_found)))
    print("How many comics do you want to download?")
    print("Type 0 to exit.")

    while True:
        try:
            comics_to_download = int(input(">> "))
        except ValueError:
            print("Error: expected a number. Try again.")
            continue
        if comics_to_download > len(comics_found) or comics_to_download < 0:
            print("Error: incorrect number of comics to download. Try again.")
            continue
        elif comics_to_download == 0:
            sys.exit()
        return comics_to_download


def grab_image_src_url(session, url):
    """
    Fetches urls with the comic image source
    """
    response = session.get(url)
    soup = bs(response.text, 'html.parser')
    for i in soup.find_all('p'):
        for img in i.find_all('img', src=True):
            return img['src']


def download_and_save_comic(session, url):
    """
    Downloads and saves the comic image
    """
    file_name = url.split('/')[-1]
    with open(os.path.join(COMICS_DIRECTORY, file_name), "wb") as file:
        response = session.get(url)
        file.write(response.content)


def fetch_comics_from_archive(session):
    """
    Grabs all urls from the poorlydrawnlines.com/archive and parses for only those that link to published comics
    """
    response = session.get(ARCHIVE_URL)
    soup = bs(response.text, 'html.parser')
    comics = [url.get("href") for url in soup.find_all("a")]
    return [url for url in comics if COMIC_PATTERN.match(url)]


def download_comic(session, url):
    """
    Download progress information
    """
    print("Downloading: {}".format(url))
    url = grab_image_src_url(session, url)
    download_and_save_comic(session, url)


def main():
    """
    Encapsulates and executes all methods in the main function
    """
    print(LOGO)

    session = requests.Session()

    comics = fetch_comics_from_archive(session)
    comics_to_download = download_comics_menu(comics)

    try:
        os.mkdir(DEFAULT_DIR_NAME)
    except OSError as exc:
        sys.exit("Failed to create directory (error_no {})".format(exc.error_no))

    start = time.time()
    with concurrent.futures.ThreadPoolExecutor() as executor:
        executor.map(lambda url: download_comic(session, url), comics[:comics_to_download])
    executor.shutdown()
    end = time.time()
    print("Finished downloading {} comics in {:.2f} sec.".format(comics_to_download, end - start))

if __name__ in "__main__":
    main()
Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
baduker
  • 19,152
  • 9
  • 33
  • 56

1 Answers1

87

I am pretty sure you are missing a shebang at the beginning of your file, for example

#!/usr/bin/env python3
#!/usr/bin/env python2
Patol75
  • 4,342
  • 1
  • 17
  • 28
  • You were right with the `#!` but it solved the issue partially. I can now run the script alright but for some reason it still won't wirte the files to disk, although it downloads them. – baduker Apr 10 '19 at 16:54
  • I guess this part `with open(os.path.join(COMICS_DIRECTORY, file_name), "wb") as file: response = session.get(url) file.write(response.content)` is the one which you mention does not work ? In that case, I would try to check everything, meaning checking that the file is created and correctly opened, that the get request succeeds, etc... Also, probably an obvious one, but you have rights to write in the target folder ? – Patol75 Apr 11 '19 at 05:32
  • 1
    Yes, I do have the rights to execute the script and write to any folder. I'll check the method and get back to you. Thanks for the feedback! – baduker Apr 11 '19 at 07:39
  • 1
    I had some code generated from a code generator, and the shebang was commented out, didn't notice it until reading your answer. – Didi Kohen Dec 22 '22 at 07:47