Python creating density map efficiently

Question

I was hoping for a bit of help to make my code run faster.

Basically I have a square grid of lat,long points in a list insideoceanlist. Then there is a directory containing data files of lat, long coords which represent lightning strikes for a particular day. The idea is for each day, we want to know how many lightning strikes there were around each point on the square grid. At the moment it is just two for loops, so for every point on the square grid, you check how far away every lightning strike was for that day. If it was within 40km I add one to that point to make a density map.

The starting grid has the overall shape of a rectangle, made up of squares with width of 0.11 and length 0.11. The entire rectange is about 50x30. Lastly I have a shapefile which outlines the 'forecast zones' in Australia, and if any point in the grid is outside this zone then we omit it. So all the leftover points (insideoceanlist) are the ones in Australia.

There are around 100000 points on the square grid and even for a slow day there are around 1000 lightning strikes, so it takes a long time to process. Is there a way to do this more efficiently? I really appreciate any advice.

By the way I changed list2 into list3 because I heard that iterating over lists is faster than arrays in python.

for i in range(len(list1)): #list1 is a list of data files containing lat,long coords for lightning strikes for each day
    dict_density = {}
    for k in insideoceanlist: #insideoceanlist is a grid of ~100000 lat,long points
        dict_density[k] = 0
    list2 = np.loadtxt(list1[i],delimiter = ",") #this open one of the files containing lat,long coords and puts it into an array
    list3 = map(list,list2) #converts the array into a list
    # the following part is what I wanted to improve
    for j in insideoceanlist:
        for l in list3:
            if great_circle(l,j).meters < 40000: #great_circle is a function which measures distance between points the two lat,long points
                dict_density[j] += 1
    #
    filename = 'example' +str(i) + '.txt'
        with open(filename, 'w') as f:
            for m in range(len(insideoceanlist)):
                f.write('%s\n' % (dict_density[insideoceanlist[m]])) #writes each point in the same order as the insideoceanlist
    f.close()

A square grid of points can be described by a mathematical function, rather than being stored in a list. Likewise, which square a point falls into (equivalent to which square's center point it is closest to) can be computed directly, without doing a ton of comparisons. — Dan Getz, Apr 20 '16 at 02:04
By what mathematical function? I'm not really sure how to check which square a point falls into without iterating over every square. Every square is about 10km wide, so if I got this would I then just check each point for the squares nearby? Cheers — tpup1, Apr 20 '16 at 02:54
How exactly is your grid defined? I was assuming "square grid" meant perfectly square, which makes this just a question of scaling numbers and rounding, but if not that makes things a little more complicated. — Dan Getz, Apr 20 '16 at 03:12
Sorry I should have been clearer. The starting grid has the overall shape of a rectangle, made up of squares with width of 0.11 and length 0.11. The entire rectange is about 50x30. Lastly I have a shapefile which outlines the 'forecast zones' in Australia, and if any point in the grid is outside this zone then we omit it. So all the leftover points (insideoceanlist) are the ones in Australia. — tpup1, Apr 20 '16 at 03:23
Write a functions that will convert lightning strike locations into the grid square that contains them. This probably means rounding/truncating the lat/long value to whatever grid you are using. Now estimate how many 0.11 degree(?) increments (in the SMALLEST case) are in 40,000 meters. For each strike, go {up, down, left, right} that many increments, compute the distance, and increment if true. — aghast, Apr 20 '16 at 03:31
That converts O(100M) into O(1000*x²), where x is a very small number (how many 0.11 degree increments in 40km). — aghast, Apr 20 '16 at 03:32
It turns out that a degree varies in size as you go north/south. But assuming your box is around Australia, you're probably in the range -10..-40 degrees. Per [wikipedia](https://en.wikipedia.org/wiki/Longitude#Length_of_a_degree_of_longitude) a degree is about 80km at 45 degrees. So a 40km range would be half a degree in the worst case, or 5 of your 0.11 degree squares. So you can find the closest grid square, then search 121 squares (5 above, the center, 5 below) to see if they are within your 40 km circle. That means for 1k strikes you have to do 121k operations, less if you're clever. — aghast, Apr 20 '16 at 05:35
Possible duplicate of [Generate a heatmap in MatPlotLib using a scatter data set](http://stackoverflow.com/questions/2369492/generate-a-heatmap-in-matplotlib-using-a-scatter-data-set) — Ian Turton, Apr 20 '16 at 16:00

score 3 · Accepted Answer · answered Apr 20 '16 at 06:39

To elaborate a bit on @DanGetz's answer, here is some code that uses the strike data as the driver, rather than iterating the entire grid for each strike point. I'm assuming you're centered on Australia's median point, with 0.11 degree grid squares, even though the size of a degree varies by latitude!

Some back-of-the-envelope computation with a quick reference to Wikipedia tells me that your 40km distance is a ±4 grid-square range from north to south, and a ±5 grid-square range from east to west. (It drops to 4 squares in the lower latitudes, but ... meh!)

The tricks here, as mentioned, are to convert from strike position (lat/lon) to grid square in a direct, formulaic manner. Figure out the position of one corner of the grid, subtract that position from the strike, then divide by the size of the grid - 0.11 degrees, truncate, and you have your row/col indexes. Now visit all the surrounding squares until the distance grows too great, which is at most 1 + (2 * 2 * 4 * 5) = 81 squares checking for distance. Increment the squares within range.

The result is that I'm doing at most 81 visits times 1000 strikes (or however many you have) as opposed to visiting 100,000 grid squares times 1000 strikes. This is a significant performance gain.

Note that you don't describe your incoming data format, so I just randomly generated numbers. You'll want to fix that. ;-)

#!python3

"""
Per WikiPedia (https://en.wikipedia.org/wiki/Centre_points_of_Australia)

Median point
============

The median point was calculated as the midpoint between the extremes of
latitude and longitude of the continent.

    24 degrees 15 minutes south latitude, 133 degrees 25 minutes east
    longitude (24°15′S 133°25′E); position on SG53-01 Henbury 1:250 000
    and 5549 James 1:100 000 scale maps.

"""
MEDIAN_LAT = -(24.00 + 15.00/60.00)
MEDIAN_LON = (133 + 25.00/60.00)

"""
From the OP:

The starting grid has the overall shape of a rectangle, made up of
squares with width of 0.11 and length 0.11. The entire rectange is about
50x30. Lastly I have a shapefile which outlines the 'forecast zones' in
Australia, and if any point in the grid is outside this zone then we
omit it. So all the leftover points (insideoceanlist) are the ones in
Australia.
"""

DELTA_LAT = 0.11
DELTA_LON = 0.11

GRID_WIDTH = 50.0 # degrees
GRID_HEIGHT = 30.0 # degrees

GRID_ROWS = int(GRID_HEIGHT / DELTA_LAT) + 1
GRID_COLS = int(GRID_WIDTH / DELTA_LON) + 1

LAT_SIGN = 1.0 if MEDIAN_LAT >= 0 else -1.0
LON_SIGN = 1.0 if MEDIAN_LON >= 0 else -1.0

GRID_LOW_LAT = MEDIAN_LAT - (LAT_SIGN * GRID_HEIGHT / 2.0)
GRID_HIGH_LAT = MEDIAN_LAT + (LAT_SIGN * GRID_HEIGHT / 2.0)
GRID_MIN_LAT = min(GRID_LOW_LAT, GRID_HIGH_LAT)
GRID_MAX_LAT = max(GRID_LOW_LAT, GRID_HIGH_LAT)

GRID_LOW_LON = MEDIAN_LON - (LON_SIGN * GRID_WIDTH / 2.0)
GRID_HIGH_LON = MEDIAN_LON + (LON_SIGN * GRID_WIDTH / 2.0)
GRID_MIN_LON = min(GRID_LOW_LON, GRID_HIGH_LON)
GRID_MAX_LON = max(GRID_LOW_LON, GRID_HIGH_LON)

GRID_PROXIMITY_KM = 40.0

"""https://en.wikipedia.org/wiki/Longitude#Length_of_a_degree_of_longitude"""
_Degree_sizes_km = (
    (0,  110.574, 111.320),
    (15, 110.649, 107.551),
    (30, 110.852, 96.486),
    (45, 111.132, 78.847),
    (60, 111.412, 55.800),
    (75, 111.618, 28.902),
    (90, 111.694, 0.000),
)

# For the Australia situation, +/- 15 degrees means that our worst
# case scenario is about 40 degrees south. At that point, a single
# degree of longitude is smallest, with a size about 80 km. That
# in turn means a 40 km distance window will span half a degree or so.
# Since grid squares a 0.11 degree across, we have to check +/- 5
# cols.

GRID_SEARCH_COLS = 5

# Latitude degrees are nice and constant-like at about 110km. That means
# a .11 degree grid square is 12km or so, making our search range +/- 4
# rows.

GRID_SEARCH_ROWS = 4

def make_grid(rows, cols):
    return [[0 for col in range(cols)] for row in range(rows)]

Grid = make_grid(GRID_ROWS, GRID_COLS)

def _col_to_lon(col):
    return GRID_LOW_LON + (LON_SIGN * DELTA_LON * col)

Col_to_lon = [_col_to_lon(c) for c in range(GRID_COLS)]

def _row_to_lat(row):
    return GRID_LOW_LAT + (LAT_SIGN * DELTA_LAT * row)

Row_to_lat = [_row_to_lat(r) for r in range(GRID_ROWS)]

def pos_to_grid(pos):
    lat, lon = pos

    if lat < GRID_MIN_LAT or lat >= GRID_MAX_LAT:
        print("Lat limits:", GRID_MIN_LAT, GRID_MAX_LAT)
        print("Position {} is outside grid.".format(pos))
        return None

    if lon < GRID_MIN_LON or lon >= GRID_MAX_LON:
        print("Lon limits:", GRID_MIN_LON, GRID_MAX_LON)
        print("Position {} is outside grid.".format(pos))
        return None

    row = int((lat - GRID_LOW_LAT) / DELTA_LAT)
    col = int((lon - GRID_LOW_LON) / DELTA_LON)

    return (row, col)


def visit_nearby_grid_points(pos, dist_km):
    row, col = pos_to_grid(pos)

    # +0, +0 is not symmetric - don't increment twice
    Grid[row][col] += 1

    for dr in range(1, GRID_SEARCH_ROWS):
        for dc in range(1, GRID_SEARCH_COLS):
            misses = 0
            gridpos = Row_to_lat[row+dr], Col_to_lon[col+dc]
            if great_circle(pos, gridpos).meters <= dist_km:
                Grid[row+dr][col+dc] += 1
            else:
                misses += 1
            gridpos = Row_to_lat[row+dr], Col_to_lon[col-dc]
            if great_circle(pos, gridpos).meters <= dist_km:
                Grid[row+dr][col-dc] += 1
            else:
                misses += 1
            gridpos = Row_to_lat[row-dr], Col_to_lon[col+dc]
            if great_circle(pos, gridpos).meters <= dist_km:
                Grid[row-dr][col+dc] += 1
            else:
                misses += 1
            gridpos = Row_to_lat[row-dr], Col_to_lon[col-dc]
            if great_circle(pos, gridpos).meters <= dist_km:
                Grid[row-dr][col-dc] += 1
            else:
                misses += 1
            if misses == 4:
                break

def get_pos_from_line(line):
    """
    FIXME: Don't know the format of your data, just random numbers
    """
    import random
    return (random.uniform(GRID_LOW_LAT, GRID_HIGH_LAT),
            random.uniform(GRID_LOW_LON, GRID_HIGH_LON))

with open("strikes.data", "r") as strikes:
    for line in strikes:
        pos = get_pos_from_line(line)
        visit_nearby_grid_points(pos, GRID_PROXIMITY_KM)

Dan Getz · Answer 2 · 2016-04-20T04:24:23.177

If you know the formula that generates the points on your grid, you can probably find the closest grid point to a given point quickly by reversing that formula.

Below is a motivating example, that isn't quite right for your purposes because the Earth is a sphere, not flat or cylindrical. If you can't easily reverse the grid point formula to find the closest grid point, then maybe you can do the following:

create a second grid (let's call it G2) that is a simple formula like below, with big enough boxes such that you can be confident that the closest grid point to any point in one box will either be in the same box, or in one of the 8 neighboring boxes.
create a dict which stores which original grid (G1) points are in which box of the G2 grid
take the point p you're trying to classify, and find the G2 box it would go into
compare p to all the G1 points in this G2 box, and all the immediate neighbors of that box
choose the G1 point of these that's closest to p

Motivating example with a perfect flat grid

If you had a perfect square grid on a flat surface, that isn't rotated, with sides of length d, then their points can be defined by a simple mathematical formula. Their latitude values will all be of the form

lat0 + d * i

for some integer value i, where lat0 is the lowest-numbered latitude, and their longitude values will be of the same form:

long0 + d * j

for some integer j. To find what the closest grid point is for a given (lat, long) pair, you can separately find its latitude and longitude. The closest latitude number on your grid will be where

i = round((lat - lat0) / d)

and likewise j = round((long - long0) / d) for the longitude.

So one way you can go forward is to plug that in to the formulas above, and get

grid_point = (lat0 + d * round((lat - lat0) / d),
              long0 + d * round((long - long0) / d)

and just increment the count in your dict at that grid point. This should make your code much, much faster than before, because instead of checking thousands of grid points for distance, you directly found the grid point with a couple calculations.

You can probably make this a little faster by using the i and j numbers as indexes into a multidimensional array, instead of using grid_point in a dict.

Degrees of longitude get closer together as latitude increases. — aghast, Apr 20 '16 at 03:48
@AustinHastings Oh darn it. I was afraid this was going to need a more complicated method. Something similar is still possible though. — Dan Getz, Apr 20 '16 at 03:48
I'm not used to working with longitudes so I guess I'll add a caveat and leave this up in hopes it helps, until someone gives a better answer. Why can't we live on a torus? — Dan Getz, Apr 20 '16 at 03:56
Given what we know, I think the worst case is +/- 5 grid squares. So the code can just start at the grid point, and work outward until the distance metric fails. — aghast, Apr 20 '16 at 05:36
This was a huge help, thank you. If you assume the lightning strikes at the top of the grid (in this case -7.25 degrees lat) you can check there are 21 points which are always within 40km. Then if you assume it's at the bottom, there are at most 44 more points which could be with 40km. I might link what I did a bit later. — tpup1, Apr 21 '16 at 13:02
@tpup1 looks like I misunderstood what you're trying to do in more ways than one, but I'm glad this was helpful for you anyway! — Dan Getz, Apr 21 '16 at 13:13

score 0 · Answer 3 · answered Apr 20 '16 at 03:07

Have you tried using Numpy for the indexing? You can use multi-dimensional arrays, and the indexing should be faster because Numpy arrays are essentially a Python wrapper around C arrays.

If you need further speed increases, take a look at Cython, a Python to optimized C converter. It is especially good for multi-dimensional indexing, and should be able to speed this type of code by about an order of magnitude. It'll add a single additional dependency to your code, but it's a quick install, and not too difficult to implement.

(Benchmarks), (Tutorial using Numpy with Cython)

Also as a quick aside, use

for listI in list1:
    ...
    list2 = np.loadtxt(listI, delimiter=',')
 # or if that doesn't work, at least use xrange() rather than range()

essentially you should only ever use range() when you explicity need the list generated by the range() function. In your case, it shouldn't do much because it is the outer-most loop.

Thanks Justin, I'll definitely have a look at Cython. I'm not sure what you mean about the indexing. I have tried using the array 'list2' for the iteration but it's a little slower. Good tip about the range(), cheers. — tpup1, Apr 20 '16 at 03:38

Python creating density map efficiently

3 Answers3

Motivating example with a perfect flat grid