While it doesn't explain your memory issues, your implementation is, to put it mildly, suboptimal. Not only are you not using numpy to its fullest capabilities, but the flow of your algorithm is also not very good at avoiding repeated calculations. I think you are simply running your system out of resources, not because something is wrong in python or numpy, but because you are creating way too many unnecessary lists of lists of lists...
After taking a look at wikipedia's entry on the Lucas-Kanade algorithm, I rewrote your main function as follows:
def lucas_kanade_np(im1, im2, win=2):
assert im1.shape == im2.shape
I_x = np.zeros(im1.shape)
I_y = np.zeros(im1.shape)
I_t = np.zeros(im1.shape)
I_x[1:-1, 1:-1] = (im1[1:-1, 2:] - im1[1:-1, :-2]) / 2
I_y[1:-1, 1:-1] = (im1[2:, 1:-1] - im1[:-2, 1:-1]) / 2
I_t[1:-1, 1:-1] = im1[1:-1, 1:-1] - im2[1:-1, 1:-1]
params = np.zeros(im1.shape + (5,)) #Ix2, Iy2, Ixy, Ixt, Iyt
params[..., 0] = I_x * I_x # I_x2
params[..., 1] = I_y * I_y # I_y2
params[..., 2] = I_x * I_y # I_xy
params[..., 3] = I_x * I_t # I_xt
params[..., 4] = I_y * I_t # I_yt
del I_x, I_y, I_t
cum_params = np.cumsum(np.cumsum(params, axis=0), axis=1)
del params
win_params = (cum_params[2 * win + 1:, 2 * win + 1:] -
cum_params[2 * win + 1:, :-1 - 2 * win] -
cum_params[:-1 - 2 * win, 2 * win + 1:] +
cum_params[:-1 - 2 * win, :-1 - 2 * win])
del cum_params
op_flow = np.zeros(im1.shape + (2,))
det = win_params[...,0] * win_params[..., 1] - win_params[..., 2] **2
op_flow_x = np.where(det != 0,
(win_params[..., 1] * win_params[..., 3] -
win_params[..., 2] * win_params[..., 4]) / det,
0)
op_flow_y = np.where(det != 0,
(win_params[..., 0] * win_params[..., 4] -
win_params[..., 2] * win_params[..., 3]) / det,
0)
op_flow[win + 1: -1 - win, win + 1: -1 - win, 0] = op_flow_x[:-1, :-1]
op_flow[win + 1: -1 - win, win + 1: -1 - win, 1] = op_flow_y[:-1, :-1]
return op_flow
It uses two nested calls to np.cumsum
and the exclusion-inclusion principle to compute the windowed parameters. Since the system of equations to solve at each point is only 2x2, it uses Cramer's rule, to vectorize the solving.
For comparison, I renamed your lucas_kanade
function as lucas_kanade_op
with a single change to the last statement, so that it returns a numpy array:
def lucas_kanade_op(im1, im2, win=2) :
...
return np.array(opfl)
I timed both approaches, (and checked that they both output the same) and no surprises, taking advantage of numpy provides a tremendous boost:
rows, cols = 100, 100
im1 = np.random.rand(rows, cols)
im2 = np.random.rand(rows, cols)
ans1 = lucas_kanade_op(im1, im2)
ans2 = lucas_kanade_np(im1, im2)
np.testing.assert_almost_equal(ans1,ans2)
import timeit
print 'op\'s time:', timeit.timeit('lucas_kanade_op(im1, im2)',
'from __main__ import lucas_kanade_op, im1, im2',
number=1)
print 'np\'s time:', timeit.timeit('lucas_kanade_np(im1, im2)',
'from __main__ import lucas_kanade_np, im1, im2',
number=1)
This prints out:
op's time: 5.7419579567
np's time: 0.00256002154425
So that's a x2000 speed increase, for a smallish 100x100 image. I did not dare test your approach for a full-size 480p image, but the function above can handle around 5 calculations on a random 854x480 array per second, with no issues whatsoever.
I'd recommend you rewrite your code in a way similar to what is proposed above, taking advantage of numpy to its fullest. Posting your full code to Code Review would be a good starting point. But there really is no point in looking for stray references to objects when your code is so inefficient to begin with!