0

I am currently creating a Django app that is supposed to run a web scraping code as soon as it starts and then respond with certain data on requests via REST API. The requirement is that it must run on Docker which is causing me a following problem: when using docker-compose up the image is being built properly, db service runs but then I get an error saying that relations in my DB do not exist. I can rectify this by running docker-compose run [service] manage.py migrate but this is a manual solution and won't work when someone clones the app from git and tries to run it via docker-compose up.

I have used command: python /teonite_webscraper/manage.py migrate --noinput in my docker-compose.yml but it does not seem to run for some reason.

docker-compose.yml:

version: '3.6'

services:
  db:
    image: postgres:10.1-alpine
    volumes:
      - postgres_data:/var/lib/postgresql/data/
  web:
    build: .
    command: python /teonite_webscraper/manage.py migrate --noinput
    command: python /teonite_webscraper/manage.py runserver 0.0.0.0:8080
    volumes:
      - .:/teonite_webscraper
    ports:
      - 8080:8080
    environment:
      - SECRET_KEY=changemeinprod
    depends_on:
      - db

volumes:
  postgres_data:

Dockerfile:

# Use an official Python runtime as a parent image
FROM python:3.7

# Set environment varibles
ENV PYTHONDONTWRITEBYTECODE 1
ENV PYTHONUNBUFFERED 1

# Set the working directory
WORKDIR /teonite_webscraper

# Copy the current directory contents into the container
COPY . /teonite_webscraper

# Install any needed packages specified in requirements.txt
RUN pip install --trusted-host pypi.python.org -r requirements.txt

# Make port 80 available to the world outside this container
EXPOSE 80

The code that is being run on initialization stage is located in apps.py inside the Django app folder within ready() function like so:

from django.apps import AppConfig

class ScraperConfig(AppConfig):
    name = 'scraper'

    def ready(self):
        import requests
        from bs4 import BeautifulSoup
        from .helpers import get_links
        from .models import Article, Author
        import json
        import re

        # For implementation check helpers.py, grabs all the article links from blog
        links = get_links('https://teonite.com/blog/')
        # List of objects to batch inject into DB to save I/Os
        objects_to_inject = []

        links_in_db = list(Article.objects.all().values_list('article_link', flat=True))
        authors_in_db = list(Author.objects.all().values_list('author_stub', flat=True))

        for link in links:

            if not link in links_in_db:
                # Grab article page
                blog_post = requests.get(link)
                # Prepare soup
                soup = BeautifulSoup(blog_post.content, 'lxml')
                # Gets the json with author data from page meta
                json_element = json.loads(soup.find_all('script')[1].get_text())

                # All of the below can be done within Articles() as parameters, but for clarity
                # I prefer separate lines, and DB models cannot be accessed outside
                # ready() at this stage anyway so refactoring to separate function wouldn't be possible
                post_data = Article()
                post_data.article_link = link
                post_data.article_content = soup.find('section', class_='post-content').get_text()

                # Regex only grabs the last part of author's URL that contains the "nickname"
                author_stub = re.search(r'\/(\w+\-?_?\.?\w+)\/$', json_element['author']['url']).group(1)

                # Check if author is already in DB if so assign the key.
                if author_stub in authors_in_db:
                    post_data.article_author = Author.objects.get(author_stub=author_stub)
                else:
                    # If not, create new DB Authors item and then assign.
                    new_author = Author(author_fullname=json_element['author']['name'],
                                         author_stub=author_stub)
                    new_author.save()
                    # Unlike links which are unique, author might appear many times and we only grab
                    # them from DB once at the beginning, so adding it here to the checklist to avoid trying to
                    # add same author multiple times
                    authors_in_db.append(author_stub)
                    post_data.article_author = new_author

                post_data.article_title = json_element['headline']
                # Append object to the list and continue
                objects_to_inject.append(post_data)

        Article.objects.bulk_create(objects_to_inject)

I am aware it is not the best practise to access the DB in ready() but I have no idea how else to make this code run when Django app has started without wiring it to a view (cannot be wired to a view due to specs).

This is the log I get after trying to run docker-compose up:

db_1   | 2018-10-12 11:46:55.928 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
db_1   | 2018-10-12 11:46:55.928 UTC [1] LOG:  listening on IPv6 address "::", port 5432
db_1   | 2018-10-12 11:46:55.933 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
db_1   | 2018-10-12 11:46:55.955 UTC [19] LOG:  database system was interrupted; last known up at 2018-10-12 11:40:40 UTC
db_1   | 2018-10-12 11:46:56.159 UTC [19] LOG:  database system was not properly shut down; automatic recovery in progress
db_1   | 2018-10-12 11:46:56.161 UTC [19] LOG:  redo starts at 0/15C0320
db_1   | 2018-10-12 11:46:56.161 UTC [19] LOG:  invalid record length at 0/15C0358: wanted 24, got 0
db_1   | 2018-10-12 11:46:56.161 UTC [19] LOG:  redo done at 0/15C0320
db_1   | 2018-10-12 11:46:56.172 UTC [1] LOG:  database system is ready to accept connections
db_1   | 2018-10-12 11:48:06.831 UTC [26] ERROR:  relation "scraper_article" does not exist at character 46
db_1   | 2018-10-12 11:48:06.831 UTC [26] STATEMENT:  SELECT "scraper_article"."article_link" FROM "scraper_article"
db_1   | 2018-10-12 11:48:10.649 UTC [27] ERROR:  relation "scraper_article" does not exist at character 46
db_1   | 2018-10-12 11:48:10.649 UTC [27] STATEMENT:  SELECT "scraper_article"."article_link" FROM "scraper_article"
db_1   | 2018-10-12 11:48:36.193 UTC [28] ERROR:  relation "scraper_article" does not exist at character 46
db_1   | 2018-10-12 11:48:36.193 UTC [28] STATEMENT:  SELECT "scraper_article"."article_link" FROM "scraper_article"
db_1   | 2018-10-12 11:48:39.820 UTC [29] ERROR:  relation "scraper_article" does not exist at character 46
db_1   | 2018-10-12 11:48:39.820 UTC [29] STATEMENT:  SELECT "scraper_article"."article_link" FROM "scraper_article"
web_1  | /usr/local/lib/python3.7/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
web_1  |   """)
db_1   | 2018-10-12 12:02:03.474 UTC [44] ERROR:  relation "scraper_article" does not exist at character 46
db_1   | 2018-10-12 12:02:03.474 UTC [44] STATEMENT:  SELECT "scraper_article"."article_link" FROM "scraper_article"
web_1  | /usr/local/lib/python3.7/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
web_1  |   """)
db_1   | 2018-10-12 12:02:07.084 UTC [45] ERROR:  relation "scraper_article" does not exist at character 46
db_1   | 2018-10-12 12:02:07.084 UTC [45] STATEMENT:  SELECT "scraper_article"."article_link" FROM "scraper_article"
web_1  | Unhandled exception in thread started by <function check_errors.<locals>.wrapper at 0x7fb5e5ac6e18>
web_1  | Traceback (most recent call last):
web_1  |   File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 85, in _execute
web_1  |     return self.cursor.execute(sql, params)
web_1  | psycopg2.ProgrammingError: relation "scraper_article" does not exist
web_1  | LINE 1: SELECT "scraper_article"."article_link" FROM "scraper_articl...
web_1  |                                                      ^
web_1  | 
web_1  | 
web_1  | The above exception was the direct cause of the following exception:
web_1  | 
web_1  | Traceback (most recent call last):
web_1  |   File "/usr/local/lib/python3.7/site-packages/django/utils/autoreload.py", line 225, in wrapper
web_1  |     fn(*args, **kwargs)
web_1  |   File "/usr/local/lib/python3.7/site-packages/django/core/management/commands/runserver.py", line 109, in inner_run
web_1  |     autoreload.raise_last_exception()
web_1  |   File "/usr/local/lib/python3.7/site-packages/django/utils/autoreload.py", line 248, in raise_last_exception
web_1  |     raise _exception[1]
web_1  |   File "/usr/local/lib/python3.7/site-packages/django/core/management/__init__.py", line 337, in execute
web_1  |     autoreload.check_errors(django.setup)()
web_1  |   File "/usr/local/lib/python3.7/site-packages/django/utils/autoreload.py", line 225, in wrapper
web_1  |     fn(*args, **kwargs)
web_1  |   File "/usr/local/lib/python3.7/site-packages/django/__init__.py", line 24, in setup
web_1  |     apps.populate(settings.INSTALLED_APPS)
web_1  |   File "/usr/local/lib/python3.7/site-packages/django/apps/registry.py", line 120, in populate
web_1  |     app_config.ready()
web_1  |   File "/teonite_webscraper/scraper/apps.py", line 19, in ready
web_1  |     links_in_db = list(Article.objects.all().values_list('article_link', flat=True))
web_1  |   File "/usr/local/lib/python3.7/site-packages/django/db/models/query.py", line 268, in __iter__
web_1  |     self._fetch_all()
web_1  |   File "/usr/local/lib/python3.7/site-packages/django/db/models/query.py", line 1186, in _fetch_all
web_1  |     self._result_cache = list(self._iterable_class(self))
web_1  |   File "/usr/local/lib/python3.7/site-packages/django/db/models/query.py", line 176, in __iter__
web_1  |     for row in compiler.results_iter(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size):
web_1  |   File "/usr/local/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1017, in results_iter
web_1  |     results = self.execute_sql(MULTI, chunked_fetch=chunked_fetch, chunk_size=chunk_size)
web_1  |   File "/usr/local/lib/python3.7/site-packages/django/db/models/sql/compiler.py", line 1065, in execute_sql
web_1  |     cursor.execute(sql, params)
web_1  |   File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 100, in execute
web_1  |     return super().execute(sql, params)
web_1  |   File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 68, in execute
web_1  |     return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
web_1  |   File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 77, in _execute_with_wrappers
web_1  |     return executor(sql, params, many, context)
web_1  |   File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 85, in _execute
web_1  |     return self.cursor.execute(sql, params)
web_1  |   File "/usr/local/lib/python3.7/site-packages/django/db/utils.py", line 89, in __exit__
web_1  |     raise dj_exc_value.with_traceback(traceback) from exc_value
web_1  |   File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 85, in _execute
web_1  |     return self.cursor.execute(sql, params)
web_1  | django.db.utils.ProgrammingError: relation "scraper_article" does not exist
web_1  | LINE 1: SELECT "scraper_article"."article_link" FROM "scraper_articl...

I have tried using entrypoint but ended up getting errors saying that file does not exist. Trying to use an additional service that will depend on db build image and then run migrate and start before the web server also did not work, I ended up getting web service exit with code 0.


Solved (solution below in answers)

Community
  • 1
  • 1
  • You can't use duplicate keys in YAML. At least when converted to python (docker-compose is Python) the first `command` will be overwritten. Make it one command, add a script or run the migration by attaching to the container: `docker exec -ti container bash`. – Klaus D. Oct 12 '18 at 12:22
  • Please move your solution to an answer of its own, thank you. – Cœur Dec 31 '18 at 16:42

2 Answers2

0

How did you use entrypoint.sh?

Like that?

entrypoint.sh:

#!/bin/sh
python manage.py makemigrations
python manage.py migrate
exec "$@"

docker-compose.yml (under 'web'):

entrypoint: /entrypoint.sh

If this desn't work try this in docker-compose.yml (under 'web')

command: python /teonite_webscraper/manage.py migrate --noinput && python /teonite_webscraper/manage.py runserver 0.0.0.0:8080
Kamil Niski
  • 4,580
  • 1
  • 11
  • 24
  • when I use `entrypoint.sh` like this this is what I get: `ERROR: for teonite_webscraper_web_1 Cannot start service web: OCI runtime create failed: container_linux.go:348: starting container process caused "exec: \"/entrypoint.sh\": stat /entrypoint.sh: no such file or directory": unknown ERROR: for web Cannot start service web: OCI runtime create failed: container_linux.go:348: starting container process caused "exec: \"/entrypoint.sh\": stat /entrypoint.sh: no such file or directory": unknown` single line `command` does not work either :/ – Bartek Gańcza Oct 12 '18 at 12:41
  • @BartekGańcza did you put `entrypoint.sh` in the same directory as `docker-compose.yml`? – Kamil Niski Oct 12 '18 at 12:43
  • Yup, they are in the same dir and in Dockerfile I do: `WORKDIR /teonite_webscraper COPY . /teonite_webscraper` – Bartek Gańcza Oct 12 '18 at 12:47
  • So did you try the the other solution at the bottom of my answer? – Kamil Niski Oct 12 '18 at 12:49
  • Last try with `entrypoint.sh`: remove `#!/bin/sh` from the file as this answer suggests: https://stackoverflow.com/a/38905412/9820085 – Kamil Niski Oct 12 '18 at 12:52
  • Shel script does not work no matter what I try. One line command worked but had to use it like `command: bash -c "python /teonite_webscraper/manage.py migrate && python /teonite_webscraper/manage.py runserver 0.0.0.0:8080"` but I found what is causing the real problem. The migrations only run when I remove all the code from `apps.py` that is trying to access DB. It seems that when Django tries to run migrations, it also runs the `ready()` function, so I need a way to extract this code outside of `ready()` and run it after the Django app is fully operational. – Bartek Gańcza Oct 12 '18 at 13:21
  • 1
    Ok I have found out how to solve the problem, thank you, you actually helped me get onto the real problem here :) – Bartek Gańcza Oct 12 '18 at 14:02
  • Please write the solution for people who will be searching it in the future. – Kamil Niski Oct 12 '18 at 14:04
  • Already did in the top of my question post :) Should I put it as an Answer here? (sorry I'm quite new to programming and more so StackOverflow) – Bartek Gańcza Oct 12 '18 at 14:47
0

I have found out what was causing the real problem here.

It seems that for some reason Django runs all the apps even while just simply performing migrations through manage.py migrate. This meant that the code I have put in ready() function was executed and tried to access a database which was not yet "created" thus preventing migration from actually running. The solution to the problem was to enclose the entire code in an if statement like so:

import sys

if not 'migrate' in sys.argv:
   [...]

and also changing command: in docker-compose.yml to a single line argument like so:

command: bash -c "python /teonite_webscraper/manage.py migrate && python /teonite_webscraper/manage.py runserver 0.0.0.0:8080"

to avoid any potential problems with multiple indentical keys in .yml file.