I use Jupyter Notebook to play with the data that I store in django/postgres. I initialize my project this way:
sys.path.append('/srv/gr/prg')
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'prg.settings')
if 'setup' in dir(django):
django.setup()
There are many individual processes that update the data and I wanted to multithread it to speed up the process. Everything works well when I do updates in a single thread or use sqlite
.
def extract_org_description(id):
o = models.Organization.objects.get(pk=id)
logging.info("Looking for description for %s" % o.symbol)
try:
content = open('/srv/data/%s.html' % o.symbol)
except FileNotFoundError:
logging.error("HTML file not found for %s" % o.symbol)
return
doc = BeautifulSoup(content, 'html.parser')
desc = doc.select("#cr_description_mod > div.cr_expandBox > div.cr_description_full.cr_expand")
if not desc or not desc[0]:
logging.info("Cannot not find description for %s" % o.symbol)
return
o.description = desc[0].text
o.save(update_fields=['description'])
logging.info("Description for %s found" % o.symbol)
return("done %s" % id)
And this will not work:
p = Pool(2)
result = p.map(extract_org_description, orgs)
print(result)
Most of the time, it will hang until I've interrupted it, without any particular error, sometimes postgres will have "There is already a transaction in progress", sometimes I see "No Results to fetch" error. Playing with the pool size I could make it work maybe once or twice but it's hard to diagnose what exactly the issue is.
I tried changing the strategy to selecting the objects and mapping them to the extract_org_description
that would take the object as the parameter (unlike selecting based on the keys), but this does not work any better.
The only thought I have is that when django is trying to autocommit, all of the individual updates, including the ones that are happening in other threads are in the same transaction scope and this is causing the issue. But I don't understand how to fix this in django.