1

I have a large address database of commercial properties (about 5 million rows), of which 200,000 have missing floor areas. The properties are classified by industry, and I know the rent for each.

My approach for interpolating the missing floor areas was to filter for similarly-classified properties within a specified radius of the property with unknown floor area, and then calculate the floor area from the median of the cost/m2 of the nearby properties.

Originally, I approached this using pandas, but that has become problematic as the dataset has grown larger (even using group_by). It often exceeds available memory, and stops. When it works, it takes about 3 hours to complete.

I am testing to see whether I can do the same task in the database. The function I've written for radial fill is as follows:

def _radial_fill(self):
    # Initial query selecting all latest locations, and excluding null rental valuations
    q = Location.objects.order_by("locode","-update_cycle") \
                        .distinct("locode")
    # Chained Q objects to use in filter
    f = Q(rental_valuation__isnull=False) & \
        Q(use_category__grouped_by__isnull=False) & \
        Q(pc__isnull=False)
    # All property categories at subgroup level
    for c in LocationCategory.objects.filter(use_category="SGP").all():
        # Start looking for appropriate interpolation locations
        fc = f & Q(use_category__grouped_by=c)
        for l in q.filter(fc & Q(floor_area__isnull=True)).all():
            r_degree = 0
            while True:
                # Default Distance is metres, so multiply accordingly
                r = (constants.BOUNDS**r_degree)*1000 # metres
                ql = q.annotate(distance=Distance("pc__point", l.pc.point)) \
                      .filter(fc & Q(floor_area__isnull=False) & Q(distance__lte=r)) \
                      .values("rental_valuation", "floor_area")
                if len(ql) < constants.LOWER_RANGE:
                    if r > constants.UPPER_RADIUS*1000:
                        # Further than the longest possible distance
                        break
                    r_degree += 1
                else:
                    m = median([x["rental_valuation"]/x["floor_area"]
                                for x in ql if x["floor_area"] > 0.0])
                    l.floor_area = l.rental_valuation / m
                    l.save()
                    break

My problem is that this function takes 6 days to run. There has to be a faster way, right? I'm sure I'm doing something terribly wrong...

The models are as follows:

class LocationCategory(models.Model):
    # Category types
    GRP = "GRP"
    SGP = "SGP"
    UST = "UST"
    CATEGORIES = (
        (GRP, "Group"),
        (SGP, "Sub-group"),
        (UST, "Use type"),
    )
    slug = models.CharField(max_length=24, primary_key=True, unique=True)
    usecode = models.CharField(max_length=14, db_index=True)
    use_category = models.CharField(max_length=3, choices=CATEGORIES,
                                    db_index=True, default=UST)
    grouped_by = models.ForeignKey("self", null=True, blank=True,
                                   on_delete=models.SET_NULL,
                                   related_name="category_by_group")

class Location(models.Model):
    # Hereditament identity and location
    slug = models.CharField(max_length=24, db_index=True)
    locode = models.CharField(max_length=14, db_index=True)
    pc = models.ForeignKey(Postcode, null=True, blank=True,
                           on_delete=models.SET_NULL,
                           related_name="locations_by_pc")
    use_category = models.ForeignKey(LocationCategory, null=True, blank=True,
                                     on_delete=models.SET_NULL,
                                     related_name="locations_by_category")
    # History fields
    update_cycle = models.CharField(max_length=14, db_index=True)
    # Location-specific econometric data
    floor_area = models.FloatField(blank=True, null=True)
    rental_valuation = models.FloatField(blank=True, null=True)

class Postcode(models.Model):
    pc = models.CharField(max_length=7, primary_key=True, unique=True) # Postcode excl space
    pcs = models.CharField(max_length=8, unique=True)                  # Postcode incl space
    # http://spatialreference.org/ref/epsg/osgb-1936-british-national-grid/
    point = models.PointField(srid=4326)

Using Django 2.0, and Postgresql 10

UPDATE

I've achieved a 35% improvement in runtime with the following code change:

# Initial query selecting all latest locations, and excluding null rental valuations
q = Location.objects.order_by("slug","-update_cycle") \
                    .distinct("slug")
# Chained Q objects to use in filter
f = Q(rental_valuation__isnull=False) & \
    Q(pc__isnull=False) & \
    Q(use_category__grouped_by_id=category_id)
# All property categories at subgroup level
# Start looking for appropriate interpolation locations
for l in q.filter(f & Q(floor_area__isnull=True)).all().iterator():
    r = q.filter(f & Q(floor_area__isnull=False) & ~Q(floor_area=0.0))
    rl = Location.objects.filter(id__in = r).annotate(distance=D("pc__point", l.pc.point)) \
                                            .order_by("distance")[:constants.LOWER_RANGE] \
                                            .annotate(floor_ratio = F("rental_valuation")/
                                                                    F("floor_area")) \
                                            .values("floor_ratio")
    if len(rl) == constants.LOWER_RANGE:
        m = median([h["floor_ratio"] for h in rl])
        l.floor_area = l.rental_valuation / m
        l.save()

The id__in=r is inefficient, but it seems the only way to maintain the distinct queryset when adding and sorting on a new annotation. Given that some 100,000 rows can be returned in the r query, any annotations applied there, with subsequent sorting by distance, can take a hellish long time.

However ... I run into numerous problems when trying to implement the Subquery functionality. AttributeError: 'ResolvedOuterRef' object has no attribute '_output_field_or_none' which I think has something to do with the annotations, but I can't find much on it.

The relevant restructured code is:

rl = Location.objects.filter(id__in = r).annotate(distance=D("pc__point", OuterRef('pc__point'))) \
                                        .order_by("distance")[:constants.LOWER_RANGE] \
                                        .annotate(floor_ratio = F("rental_valuation")/
                                                                F("floor_area")) \
                                        .distinct("floor_ratio")

and:

l.update(floor_area= F("rental_valuation") / CustomAVG(Subquery(locs),0))

I can see that this approach should be tremendously efficient, but getting it right seems somewhat far beyond my skill level.

Turukawa
  • 155
  • 2
  • 11

1 Answers1

1

You can simplify your method using (mostly) the built-in query methods of Django which are optimized. More specifically we will use:


We will create a custom Aggregate class to apply our AVG function (method inspired by this excellent answer: Django 1.11 Annotating a Subquery Aggregate)

class CustomAVG(Subquery):
    template = "(SELECT AVG(area_value) FROM (%(subquery)s))"
    output_field = models.FloatField()

and we will use it to calculate the following average:

for location in Location.objects.filter(rental_valuation__isnull=True):
    location.update(
        rental_valuation=CustomAVG(
            Subquery(
                Location.objects.filter(
                    pc__point__dwithin=(OuterRef('pc__point'), D(m=1000)),
                    rental_valuation__isnull=False
                ).annotate(area_value=F('rental_valuation')/F('floor_area'))
                .distinct('area_value')
            )
        )
    )

Breakdown of the above:

  • We collect all the Location objects without a rental_valuation and we "pass" through the list.
  • Subquery: We select the Location objects that are within a circle of radius=1000m (change that as you like) from our current location point and we annotate on them the cost/m2 calculation (using F() to take the value of columns rental_valuation and floor_area of each object), as a column named area_value. For more accurate results, we select only the distinct values of this column.
  • We apply our CustomAVG to the Subquery and we update our current locations rental_valuation.
John Moutafis
  • 22,254
  • 11
  • 68
  • 112
  • Thanks for this. I've also tried splitting out the individual loops into Celery processes to see if that has any effect. I'll try publish some updated times once I've incorporated your suggestions... – Turukawa Apr 08 '18 at 13:59
  • One concern is that there is a need to ensure some sort of statistical integrity by limiting the calculated `rental_valuation` to only where the count of locations is > `constants.LOWER_RANGE`. i.e. it would be risky to use only 1 or 2 other locations to infer 'area_value' and I want to use the smallest area possible to get to the count of properties able to produce that average. Any thoughts? – Turukawa Apr 08 '18 at 14:06
  • @Turukawa For the first part, you can use conditional expressions: https://docs.djangoproject.com/en/2.0/ref/models/conditional-expressions/. For the second part, you need to define the smallest range possible, according to your dataset. – John Moutafis Apr 10 '18 at 07:13
  • @Turukawa did you find this answer helpful? – John Moutafis Apr 12 '18 at 07:32
  • To be honest, I'm still figuring out how to implement it ... I've achieved a 35% time improvement so far (6 days to 4 days), but without Subquery. Problems I have are sorting on an annotation after applying `distinct` (which seems to require a second query), and then figuring out how to get a median instead of `AVG` since I need to discard outliers (not solved yet)... Still working on this and will come back to you in a few days. – Turukawa Apr 12 '18 at 15:10