1

I am looping through my database and updating all my Company objects.

for company in Company.objects.filter(updated=False):
    driver.get(company.company_url)
    company.adress = driver.find_element_by_id("address").text
    company.visited = True    
    company.save()

My problem is that it's taking too long so I wanted to run another instance of this same code, but I'm curious when the actual db reads happen. If company.visited get's changed to True while this loop is running, will still be visited by this loop? What if I added a second check for visited? I don't want to start a second loop if the first instance isn't going to recognize the work of the second instance:

for company in Company.objects.filter(updated=False):
    if company.visited:
        continue
    driver.get(company.company_url)
    company.adress = driver.find_element_by_id("address").text
    company.visited = True    
    company.save()
Max Malysh
  • 29,384
  • 19
  • 111
  • 115
Chase Roberts
  • 9,082
  • 13
  • 73
  • 131

1 Answers1

1

Company.objects.filter(updated=False) translates to an ordinary SQL query:

SELECT * FROM appName_company WHERE updated is false

This SQL query is executed when you start iterating through Company objects. It's executed only once. The second server will not recognize the work of the first one, because they both will go through the same Company objects.

Lock rows to avoid race conditions using atomic transactions and select_for_update():

from django.db import transaction

for company in Company.objects.filter(updated=False):
    with transaction.atomic():
        Company.objects.select_for_update().get(id=company.id)
        if company.visited:
            continue
        driver.get(company.company_url)
        company.adress = driver.find_element_by_id("address").text
        company.visited = True    
        company.save()

You can run this code on multiple servers. Each Company will be processed just once.

If you need to execute this code regularly, I highly recommend using Celery. Dispatch a task per each company, and let multiple workers do the work in parallel:

from celery import shared_task

@shared_task
def dispatch_tasks():
    for company in Company.objects.filter(updated=False):
        process_company.delay(company.id)

@shared_task
@transaction.atomic
def process_company(company_id):
    company = Company.objects.select_for_update().get(id=company_id)
    if company.visited:
        continue
    driver.get(company.company_url)
    company.adress = driver.find_element_by_id("address").text
    company.visited = True    
    company.save()

Edit: oh, I see that you've tagged the question with the sqlite tag. I recommend switching to PostgreSQL, as SQLite is really bad at concurrency. My answer should work with SQlite, but locks may slow down the database.

Max Malysh
  • 29,384
  • 19
  • 111
  • 115
  • `select_for_update()`is ignored on SQLite: https://docs.djangoproject.com/en/2.1/ref/databases/#queryset-select-for-update-not-supported See also: https://stackoverflow.com/a/46949541/1097104 – Juuso Ohtonen Oct 21 '18 at 07:05