1

The working for loop, expected result

I am attempting to vectorize a slow for loop in code with a very large dataset to remove duplicates based on a test. The result should keep only elements where the first 3 elements are unique, and the 4th element is the largest of all duplicates. e.g.

in = np.array(((0, 12, 13, 1), (0, 12, 13, 10), (1, 12, 13, 2)))

should become

out = np.array(((0, 12, 13, 10), (1, 12, 13, 2)))

This is trivial to achieve with a for loop, but as I mentioned, it is very slow.

unique = np.unique(in[:, :3], axis=0)
out = np.empty((0, 4))
for i in unique:
    out = np.vstack((out, np.hstack((i[:], np.max(in[np.all(in[:, :3] == i[:], axis=1)][:, 3])))))

What I've tried (1)

When I attempt to remove the for loop with indices by replacing each i[:] with unique[np.arange(unique.shape[0])] :

out = np.vstack((out, np.hstack((unique[np.arange(unique.shape[0])], np.max(in[np.all(in[:, :3].astype(int) == unique[np.arange(unique.shape[0])], axis=1)][:, 3])))))

Numpy complains about the input shape in conjunction with all:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 6, in all
  File "/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py", line 2351, in all
    return _wrapreduction(a, np.logical_and, 'all', axis, None, out, keepdims=keepdims)
  File "/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py", line 90, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
numpy.AxisError: axis 1 is out of bounds for array of dimension 0

What I've tried (2)

Based on a suggestion from StackOverflow while typing this question (Broadcasting/Vectorizing inner and outer for loops in python/NumPy):

newout = np.vstack((newout, np.hstack((tempunique[:, None], np.max(inout[np.all(inout[:, :3].astype(int) == tempunique[:, None], axis=1)][:, 3])))))

I get an error complaining about the size mismatch between input and output:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: boolean index did not match indexed array along dimension 0; dimension is 3 but corresponding boolean dimension is 2

Restating the question

Is there a correct way to broadcast my indices to eliminate the for loop?

WesH
  • 460
  • 5
  • 15
  • 1
    Would it only have positive numbers? Does it have a range? – Divakar Jun 04 '20 at 20:08
  • Values are always positive. This is a list of contour radii on an image, so the max values are ((index number, max=1920, max=1080, max=~500)) – WesH Jun 04 '20 at 20:11

1 Answers1

2

I don't know enough about your use case to determine if it's worth introducing Pandas, but efficiently doing so in Pandas requires just a few lines of code:

import numpy as np
import pandas as pd

in_array = np.array(((0, 12, 13, 1), (0, 12, 13, 10), (1, 12, 13, 2)))
in_df = pd.DataFrame(in_array)


# group by unique combinations of the 0th, 1st, and 2nd columns, then take the
# max of the 3rd column in each group. `reset_index` change cols 0-2 from index
# back to normal columns
out_df = in_df.groupby([0, 1, 2])[3].max().reset_index()
out_array = out_df.values

print(out_array)
# Output:
# [[ 0 12 13 10]
#  [ 1 12 13  2]]

A simple timing test shows that processing a 100000-rows randomly generated input array takes 0.0117 secs with Pandas, and 2.6103 secs with your for loop implementation.

xcmkz
  • 677
  • 4
  • 15
  • I'm not sure that ``pandas`` is going to be much faster than a for loop when my input data is 8mil lines. However, the ``groupby`` did lead me to https://stackoverflow.com/questions/38013778/is-there-any-numpy-group-by-function. I'm going to see if I can sort something out with this lead. – WesH Jun 04 '20 at 19:52
  • With the ``numpy`` version of ``groupby``, I still needed to add a for loop. The ``pandas`` code is ~20 times faster when I timed it. Looks like I'm adding ``pandas``! – WesH Jun 04 '20 at 20:05
  • Pandas is fairly well optimized actually. I did a simple scaling test on my laptop and on a random input array of shape (8mil, 4) it only took 0.52 seconds (although I guess the speedup also depends on the content of the array). – xcmkz Jun 04 '20 at 20:08
  • Yeah, I was impressed. The calc is a bit lower with floats in the real data, but the speed is worthy. – WesH Jun 04 '20 at 20:09