Optimizing Django queryset calls, with large lists?

Question

I will try to explain my problem as best I can. I'm having performance issues, and I wonder if there's a better way of setting this up.

I have a database with roughly 150 years worth of year-by-year data. Each row has about 10 columns.

I am running a "simulation" on every 30yr span (we'll call each 30yr chunk a "Cycle"). So, Cycle 1 would be years 1-31. Cycle 2 would be years 2-32. Cycle 3, years 3-33. Get the idea?

all_data = DataPoint.objects.all().yearly()
cycle_length = 30
previous_data_points = []
for data_point in all_data:
    if len(previous_data_points) < cycle_length:
        previous_data_points.append(data_point)
        continue
    simulation_cycles.append(Cycle(previous_data_points))
    if len(previous_data_points) == cycle_length:
        previous_data_points.pop(0)
    previous_data_points.append(data_point)

So, for each 30 year Cycle, I feed the Cycle function the 30 queryset items to initialize data. Problem is, when I use Django's connection.queries to list out what's going on, it looks like it's doing 3000+ queries and is taking 10-12 seconds, which is pretty long for what it's doing.

In the connection.queries list, I see that it's doing 30 separate calls (1 for each data point) when I pass it into Cycle (each one using "WHERE EXTRACT(MONTH FROM" which I believe is my .yearly() filter being called. But then it also is logging a query inside of Cycle() where it actually looks up the data point by date.

When I pass along parts of querysets like that, is the data being cached? Is my app really running 3000 queries (using connections on the db) or is it 1 big query in my DataPoint.objects.all().yearly() call, and everything else is in memory?

I'm trying to understand why this is running so slowly. Part of the problem, I assume, is that I'm creating this huge list of objects: 120 "Cycles", each with 30 individual "Year" objects that have data of their own (and calculations to make that data) for later. Is having that many objects in memory going to hurt me, or is that small potatoes?

Edit:

class Cycle:
def __init__(self, data_points):
    self.range_start = data_points[0].data_date
    self.range_end = data_points[-1].data_date
    self.start_CPI = data_points[0].cpi
    self.years = relativedelta(self.range_end, self.range_start).years + 1
    self.sim = []

    for i in range (0, self.years):
        data_point = data_points[i]
        self.sim.append(Segment(
            date=self.range_start + relativedelta(years=i),
            start_CPI=self.start_CPI,
            yearly_equities_growth=data_point.yearly_equities_growth,
            cpi=data_point.cpi,
            dividend=data_point.dividend,
            s_and_p_composite=data_point.s_and_p_composite,
            long_interest_rate=data_point.long_interest_rate
        ))

class Segment:
def __init__(self, date, start_CPI, yearly_equities_growth, cpi, dividend, s_and_p_composite, long_interest_rate):
    self.start_CPI = D(start_CPI)
    self.date = date
    self.portfolio = {
        "start": None,
        "end": None,
        "inflation_adjusted_start": None,
        "inflation_adjusted_end": None,
        "fees": None
    }
    self.spending = None
    self.inflation_adjusted_spending = None
    self.equities = {
        "start": None,
        "growth": None,
        "val": None
    }
    self.bonds = {
        "start": None,
        "growth": None,
        "val": None
    }
    self.gold = {
        "start": None,
        "growth": None,
        "val": None
    }
    self.cash = {
        "start": None,
        "growth": None,
        "val": None
    }
    self.dividends = {
        "growth": None,
        "val": None
    }
    self.fees = None

    self.yearly_equities_growth = D(yearly_equities_growth)
    self.cumulative_inflation = 1 + (D(cpi) - self.start_CPI) / self.start_CPI
    self.sum_of_adjustments = None
    self.cpi = cpi
    self.dividend = dividend
    self.s_and_p_composite = s_and_p_composite
    self.long_interest_rate = long_interest_rate

can you share the code for the Cycle class? to understand your problem better — jperelli, Mar 18 '17 at 01:49
done. Cycle() also calls Segment() so I added that. Each Cycle has 30 Segments. — bo_knows, Mar 18 '17 at 02:49

score 0 · Answer 1 · answered Mar 18 '17 at 15:03

0

From your code, it seems to me that this part is doing a O(n²) algorithm

for i in range (0, self.years):

If any code running inside that for needs to go to database, that is the problem.

Check if any of this fields is hitting the database. If maybe any of those is a related field, you need to prefetch_related or select_related. Or maybe any of those is a property that needs to hit the database, chech for that.

data_point.yearly_equities_growth,
data_point.cpi,
data_point.dividend,
data_point.s_and_p_composite,
data_point.long_interest_rate

Also, what is the D() function doing? It might be forcing a query to database, check that.

answered Mar 18 '17 at 15:03

jperelli

6,988
5
50
85

Those properties DO hit the DB, but I *assumed* that when I passed in the `data_point`, that it already had the data. D() is just `from decimal import Decimal as D` – bo_knows Mar 18 '17 at 15:05
I'm not familiar with `prefetch_related or select_related`, I will check them out and report back. – bo_knows Mar 18 '17 at 15:06
prefetch_related and select_related help in cases where you use properties that are relations with other tables – jperelli Mar 18 '17 at 15:13
I see that. This table actually has no relations. It is a pretty simple data store. If in the very beginning, I change `all_data = DataPoint.objects.all().yearly()` to this `cached_data = [] all_data = DataPoint.objects.all().yearly() for a in all_data: cached_data.append(a)` Then I can use cached_data[] to populate the Cycle() and Segment() objects. I assume that will be more DB efficient? – bo_knows Mar 18 '17 at 15:26
It might be that. Maybe `data_point = data_points[i]` forcing a query – jperelli Mar 18 '17 at 16:09

score 0 · Answer 2 · edited May 23 '17 at 12:09

Iterating over a QuerySet is supposed to query everything on the first iteration and then use the cached result for subsequent iterations (source). I'm not familiar with connection.queries, but

You can use django-debug-toolbar, django-extensions or django silk to sanity-check the 3000+ query hypothesis (source).

@jperelli suggested that `data_point = data_points[i] might be the culprit.

for i in range (0, self.years):
    data_point = data_points[i]

What happens if you cast all_data to a list before passing it into the iterator?

Optimizing Django queryset calls, with large lists?

2 Answers2