I'm going to be doing some geometric calculations involving 2-D and 3D points using numpy.
What is the canonical representation of a 2-D or 3-D point? Please assume minimal familiarity with numpy, data shapes, etc.
I'm going to be doing some geometric calculations involving 2-D and 3D points using numpy.
What is the canonical representation of a 2-D or 3-D point? Please assume minimal familiarity with numpy, data shapes, etc.
The representation of a single point in Cartesian space is somewhat trivial. You could even use flat tuples or lists to represent them and matrix operations would still work, but if you want to add or scale them (which is fundamentally what linear spaces are for) you have to use arrays. I don't see a reason why not to use a 1d array with shape (d,)
in d
dimensions: you can use those both as column and row vectors on either side of a matrix using the @
matmul operator:
import numpy as np
rot90 = np.array([[0, -1, 0], [1, 0, 0], [0, 0, 1]]) # rotate 90 degrees around z
inp = np.array([1, 0, 0]) # x
# rotate:
inp_rot = rot90 @ inp # y
# inverse transform:
inp_invrot = inp @ rot90 # -y
A much better question is how to represent collections of points in Cartesian space. If you have N
points you will probably want to use a 2d array. But which shape should it be, (N, d)
or (d, N)
? The answer depends on your use case but without further input you'll want to choose (N, d)
.
Arrays in numpy are "C-contiguous" by default, which is also called row-major memory layout. This means that on creation an array occupies a contiguous block of memory by default, and items are laid out in memory row after row, with these indices as an example:
>>> np.arange(2*3).reshape(2, 3)
array([[0, 1, 2],
[3, 4, 5]])
One of the reasons we use numpy is that a contiguous block of memory for a given type occupies much less space than a native python container of the same size, at least for large datasets. The other reason is that we can use vectorized operations that work on slices of the input "simultaneously". The quotes are there because fundamentally the hands of the CPU are bound, but it turns out that you can achieve quite some speedup by making good use of CPU caches. And this is where memory layout comes into play: by using operations on an array that access elements close in memory you have a higher chance of making use of caching, and the reduced communication between RAM and CPU will lead to shorter runtimes.
The problem is not trivial, because vectorizing along larger non-contiguous dimensions might end up faster than vectorizing along smaller contiguous ones. However, without any additional information it's a good rule of thumb to put those dimensions last where you are likely to perform vectorized operations and reductions such as .mean()
or .sum()
. In case of N
points in d
-dimensional space it's quite likely that you will want to handle each point separately. Loops in matrix multiplications and things like scalar products and vector norms will all want you to work with one component after the other for a given point.
This is why you will see numpy and scipy functions usually assume arrays of shape (N, d)
: the inner dimension is second and the "batch" index is first. Consider for example numpy.linalg.eig
:
Parameters:
a : (…, M, M) array
Matrices for which the eigenvalues and right eigenvectors will be computed
Returns:
w : (…, M) array
The eigenvalues, each repeated according to its multiplicity. The eigenvalues
are not necessarily ordered. The resulting array will be of complex type,
unless the imaginary part is zero in which case it will be cast to a real
type. When a is real the resulting eigenvalues will be real (0 imaginary
part) or occur in conjugate pairs
[...]
It treats multidimensional arrays as batches of matrices, where the last two indices correspond to the Cartesian indices. Similarly the returned eigenvalues and eigenvectors have batch indices first and vector space indices last.
A more direct example is scipy.spatial.distance.pdist
which computes the distance between pairs of points in a collection:
Parameters
X : ndarray
An m by n array of m original observations in an n-dimensional space.
[...]
Again you can see the convention that Cartesian indices are last. The same goes for scipy.interpolate.griddata
and probably a bunch of other functions.
So if you have a good reason to use either representation: do that. But if you don't have a good indicator (such as the results of profiling both representations) you should stick with the "batch of vectors/matrices" approach usually employed by numpy and scipy (shape (N, d)
), because you might even end up using some of these functions, for which your representation will then be native.
Represent them in your source code as tuples or lists, e.g. (1, 0)
or [1, 0, 1]
.
As per this example from scipy:
>>> from scipy.spatial import distance
>>> distance.euclidean([1, 0, 0], [0, 1, 0])
1.4142135623730951