I am currently creating a Django app that is supposed to run a web scraping code as soon as it starts and then respond with certain data on requests via REST API. The requirement is that it must run on Docker which is causing me a following problem: when using docker-compose up the image is being built properly, db service runs but then I get an error saying that relations in my DB do not exist. I can rectify this by running docker-compose run [service] manage.py migrate
but this is a manual solution and won't work when someone clones the app from git and tries to run it via docker-compose up
.
I have used command: python /teonite_webscraper/manage.py migrate --noinput
in my docker-compose.yml
but it does not seem to run for some reason.
docker-compose.yml:
version: '3.6'
services:
db:
image: postgres:10.1-alpine
volumes:
- postgres_data:/var/lib/postgresql/data/
web:
build: .
command: python /teonite_webscraper/manage.py migrate --noinput
command: python /teonite_webscraper/manage.py runserver 0.0.0.0:8080
volumes:
- .:/teonite_webscraper
ports:
- 8080:8080
environment:
- SECRET_KEY=changemeinprod
depends_on:
- db
volumes:
postgres_data:
Dockerfile:
# Use an official Python runtime as a parent image
FROM python:3.7
# Set environment varibles
ENV PYTHONDONTWRITEBYTECODE 1
ENV PYTHONUNBUFFERED 1
# Set the working directory
WORKDIR /teonite_webscraper
# Copy the current directory contents into the container
COPY . /teonite_webscraper
# Install any needed packages specified in requirements.txt
RUN pip install --trusted-host pypi.python.org -r requirements.txt
# Make port 80 available to the world outside this container
EXPOSE 80
The code that is being run on initialization stage is located in apps.py inside the Django app folder within ready()
function like so:
from django.apps import AppConfig
class ScraperConfig(AppConfig):
name = 'scraper'
def ready(self):
import requests
from bs4 import BeautifulSoup
from .helpers import get_links
from .models import Article, Author
import json
import re
# For implementation check helpers.py, grabs all the article links from blog
links = get_links('https://teonite.com/blog/')
# List of objects to batch inject into DB to save I/Os
objects_to_inject = []
links_in_db = list(Article.objects.all().values_list('article_link', flat=True))
authors_in_db = list(Author.objects.all().values_list('author_stub', flat=True))
for link in links:
if not link in links_in_db:
# Grab article page
blog_post = requests.get(link)
# Prepare soup
soup = BeautifulSoup(blog_post.content, 'lxml')
# Gets the json with author data from page meta
json_element = json.loads(soup.find_all('script')[1].get_text())
# All of the below can be done within Articles() as parameters, but for clarity
# I prefer separate lines, and DB models cannot be accessed outside
# ready() at this stage anyway so refactoring to separate function wouldn't be possible
post_data = Article()
post_data.article_link = link
post_data.article_content = soup.find('section', class_='post-content').get_text()
# Regex only grabs the last part of author's URL that contains the "nickname"
author_stub = re.search(r'\/(\w+\-?_?\.?\w+)\/$', json_element['author']['url']).group(1)
# Check if author is already in DB if so assign the key.
if author_stub in authors_in_db:
post_data.article_author = Author.objects.get(author_stub=author_stub)
else:
# If not, create new DB Authors item and then assign.
new_author = Author(author_fullname=json_element['author']['name'],
author_stub=author_stub)
new_author.save()
# Unlike links which are unique, author might appear many times and we only grab
# them from DB once at the beginning, so adding it here to the checklist to avoid trying to
# add same author multiple times
authors_in_db.append(author_stub)
post_data.article_author = new_author
post_data.article_title = json_element['headline']
# Append object to the list and continue
objects_to_inject.append(post_data)
Article.objects.bulk_create(objects_to_inject)
I am aware it is not the best practise to access the DB in ready()
but I have no idea how else to make this code run when Django app has started without wiring it to a view (cannot be wired to a view due to specs).
This is the log I get after trying to run docker-compose up
:
db_1 | 2018-10-12 11:46:55.928 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
db_1 | 2018-10-12 11:46:55.928 UTC [1] LOG: listening on IPv6 address "::", port 5432
db_1 | 2018-10-12 11:46:55.933 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
db_1 | 2018-10-12 11:46:55.955 UTC [19] LOG: database system was interrupted; last known up at 2018-10-12 11:40:40 UTC
db_1 | 2018-10-12 11:46:56.159 UTC [19] LOG: database system was not properly shut down; automatic recovery in progress
db_1 | 2018-10-12 11:46:56.161 UTC [19] LOG: redo starts at 0/15C0320
db_1 | 2018-10-12 11:46:56.161 UTC [19] LOG: invalid record length at 0/15C0358: wanted 24, got 0
db_1 | 2018-10-12 11:46:56.161 UTC [19] LOG: redo done at 0/15C0320
db_1 | 2018-10-12 11:46:56.172 UTC [1] LOG: database system is ready to accept connections
db_1 | 2018-10-12 11:48:06.831 UTC [26] ERROR: relation "scraper_article" does not exist at character 46
db_1 | 2018-10-12 11:48:06.831 UTC [26] STATEMENT: SELECT "scraper_article"."article_link" FROM "scraper_article"
db_1 | 2018-10-12 11:48:10.649 UTC [27] ERROR: relation "scraper_article" does not exist at character 46
db_1 | 2018-10-12 11:48:10.649 UTC [27] STATEMENT: SELECT "scraper_article"."article_link" FROM "scraper_article"
db_1 | 2018-10-12 11:48:36.193 UTC [28] ERROR: relation "scraper_article" does not exist at character 46
db_1 | 2018-10-12 11:48:36.193 UTC [28] STATEMENT: SELECT "scraper_article"."article_link" FROM "scraper_article"
db_1 | 2018-10-12 11:48:39.820 UTC [29] ERROR: relation "scraper_article" does not exist at character 46
db_1 | 2018-10-12 11:48:39.820 UTC [29] STATEMENT: SELECT "scraper_article"."article_link" FROM "scraper_article"
web_1 | /usr/local/lib/python3.7/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
web_1 | """)
db_1 | 2018-10-12 12:02:03.474 UTC [44] ERROR: relation "scraper_article" does not exist at character 46
db_1 | 2018-10-12 12:02:03.474 UTC [44] STATEMENT: SELECT "scraper_article"."article_link" FROM "scraper_article"
web_1 | /usr/local/lib/python3.7/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
web_1 | """)
db_1 | 2018-10-12 12:02:07.084 UTC [45] ERROR: relation "scraper_article" does not exist at character 46
db_1 | 2018-10-12 12:02:07.084 UTC [45] STATEMENT: SELECT "scraper_article"."article_link" FROM "scraper_article"
web_1 | Unhandled exception in thread started by <function check_errors.<locals>.wrapper at 0x7fb5e5ac6e18>
web_1 | Traceback (most recent call last):
web_1 | File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 85, in _execute
web_1 | return self.cursor.execute(sql, params)
web_1 | psycopg2.ProgrammingError: relation "scraper_article" does not exist
web_1 | LINE 1: SELECT "scraper_article"."article_link" FROM "scraper_articl...
web_1 | ^
web_1 |
web_1 |
web_1 | The above exception was the direct cause of the following exception:
web_1 |
web_1 | Traceback (most recent call last):
web_1 | File "/usr/local/lib/python3.7/site-packages/django/utils/autoreload.py", line 225, in wrapper
web_1 | fn(*args, **kwargs)
web_1 | File "/usr/local/lib/python3.7/site-packages/django/core/management/commands/runserver.py", line 109, in inner_run
web_1 | autoreload.raise_last_exception()
web_1 | File "/usr/local/lib/python3.7/site-packages/django/utils/autoreload.py", line 248, in raise_last_exception
web_1 | raise _exception[1]
web_1 | File "/usr/local/lib/python3.7/site-packages/django/core/management/__init__.py", line 337, in execute
web_1 | autoreload.check_errors(django.setup)()
web_1 | File "/usr/local/lib/python3.7/site-packages/django/utils/autoreload.py", line 225, in wrapper
web_1 | fn(*args, **kwargs)
web_1 | File "/usr/local/lib/python3.7/site-packages/django/__init__.py", line 24, in setup
web_1 | apps.populate(settings.INSTALLED_APPS)
web_1 | File "/usr/local/lib/python3.7/site-packages/django/apps/registry.py", line 120, in populate
web_1 | app_config.ready()
web_1 | File "/teonite_webscraper/scraper/apps.py", line 19, in ready
web_1 | links_in_db = list(Article.objects.all().values_list('article_link', flat=True))
web_1 | File "/usr/local/lib/python3.7/site-packages/django/db/models/query.py", line 268, in __iter__
web_1 | self._fetch_all()
web_1 | File "/usr/local/lib/python3.7/site-packages/django/db/models/query.py", line 1186, in _fetch_all
web_1 | self._result_cache = list(self._iterable_class(self))
web_1 | File "/usr/local/lib/python3.7/site-packages/django/db/models/query.py", line 176, in __iter__
web_1 | for row in compiler.results_iter(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size):
web_1 | File "/usr/local/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1017, in results_iter
web_1 | results = self.execute_sql(MULTI, chunked_fetch=chunked_fetch, chunk_size=chunk_size)
web_1 | File "/usr/local/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1065, in execute_sql
web_1 | cursor.execute(sql, params)
web_1 | File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 100, in execute
web_1 | return super().execute(sql, params)
web_1 | File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 68, in execute
web_1 | return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
web_1 | File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 77, in _execute_with_wrappers
web_1 | return executor(sql, params, many, context)
web_1 | File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 85, in _execute
web_1 | return self.cursor.execute(sql, params)
web_1 | File "/usr/local/lib/python3.7/site-packages/django/db/utils.py", line 89, in __exit__
web_1 | raise dj_exc_value.with_traceback(traceback) from exc_value
web_1 | File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 85, in _execute
web_1 | return self.cursor.execute(sql, params)
web_1 | django.db.utils.ProgrammingError: relation "scraper_article" does not exist
web_1 | LINE 1: SELECT "scraper_article"."article_link" FROM "scraper_articl...
I have tried using entrypoint
but ended up getting errors saying that file does not exist. Trying to use an additional service that will depend on db
build image and then run migrate and start before the web server also did not work, I ended up getting web service exit with code 0.