1

Here's the example I'm working on:

 processed_data = np.empty_like(data)
 min_per_col = np.amin(data, axis=0) # axis0 for col, axis1 for row
 max_per_col = np.amax(data, axis=0) # axis0 for col, axis1 for row
 for row_idx, row in enumerate(data):
     for col_idx, val in enumerate(row):
         processed_data[row_idx][col_idx] = (val - min_per_col[col_idx]) / (max_per_col[col_idx] - min_per_col[col_idx])

data is defined as a 2d numpy array. I am essentially trying to perform some operation on each element in data using the relevant values in min_per_col and max_per_col.

I can't seem to figure out the approach to take. It seems like from these posts the answer is to reshape the arrays so that broadcasting works.

Intuitively, I think the way it would work with broadcasting would be:

# Results of min_per_col: 
#     [min1 min2 min3 min4 min5]

# Transformation to (call this 2d_min_per_col):
#     [[min1 min2 min3 min4 min5],
#      [min1 min2 min3 min4 min5],
#      [min1 min2 min3 min4 min5]
#      ...
#      [min1 min2 min3 min4 min5]]
# which basically duplicates min_per_col into a 2d array form.

# Do the same for max (2d_max_per_col)

# processed_data = (data - 2d_min_per_col) / (2d_max_per_col - 2d_min_per_col)

Does this approach make sense? Or is there another answer for how to approach something like this?

Please let me know if there's anything else that would be helpful to include for this post! Thank you.

EDIT: Thanks for the help Mad Physicist! After trying this:

processed_data = np.empty_like(data)
min_per_col = np.amin(data, axis=0) # axis0 for col, axis1 for row
max_per_col = np.amax(data, axis=0) # axis0 for col, axis1 for row
for row_idx, row in enumerate(data):
    for col_idx, val in enumerate(row):
        processed_data[row_idx, col_idx] = (val - min_per_col[col_idx]) / (max_per_col[col_idx] - min_per_col[col_idx])
print("version 1\n", processed_data)

processed_data = (data - min_per_col) / (max_per_col - min_per_col)
print("version 2\n", processed_data)

return processed_data

It works identically, and is much faster!

version 1
 [[0.25333333 0.13793103 0.14285714]
 [0.32       0.79310345 0.92857143]
 [0.13333333 0.48275862 0.51785714]
 ...
 [0.28       0.4137931  0.125     ]
 [0.01333333 0.24137931 0.75      ]
 [0.08       0.20689655 0.23214286]]
version 2
 [[0.25333333 0.13793103 0.14285714]
 [0.32       0.79310345 0.92857143]
 [0.13333333 0.48275862 0.51785714]
 ...
 [0.28       0.4137931  0.125     ]
 [0.01333333 0.24137931 0.75      ]
 [0.08       0.20689655 0.23214286]]

Thanks for the fast help :D

  • Never index a numpy array as `[index1][index2]` unless you know what you're doing. Always use `[index1, index2]` – Mad Physicist Apr 08 '21 at 20:14
  • Have you tried the broadcasting approach? Did it work (as in same results as loop)? – Mad Physicist Apr 08 '21 at 20:16
  • Also, the whole point of broadcasting is that you only need the first row of `min/max_per_col`, not the whole expanded array. – Mad Physicist Apr 08 '21 at 20:17
  • Ah cool - from this numpy documentation https://numpy.org/devdocs/user/basics.indexing.html it seems like the reason is because [index1][index2] creates a temporary array in memory, whereas [index1, index2] accesses the element directly with no in-between, making [index1, index2] more optimal. Thanks for the tip! – Griffin Beels Apr 08 '21 at 20:19
  • It's a temporary array object, but if you are lucky, no memory is copied and you get a view, as in this case. But for more complicated indices, especially non-slice indices, you will get nothing but trouble. Good on you for looking it up. – Mad Physicist Apr 08 '21 at 20:20
  • re: only needing first row of `min/max_per_col` does this mean that numpy figures out the correct indexing into `min/max_per_col` for each row without us needing to manually expand? – Griffin Beels Apr 08 '21 at 20:21
  • That's exactly correct: broadcasting means that you effectively repeat that row as many times as you need. – Mad Physicist Apr 08 '21 at 20:57
  • Don't forget to select the answer to get your question off the unanswered queue – Mad Physicist Apr 08 '21 at 20:57

1 Answers1

1

You have the gist of it, but the whole point of broadcasting is that you don't need to expand arrays to do operations on them: the shapes are lined up on the right. So for example, let's say data.shape is (M, N) your array shapes look like this to the math operations:

data:           (M, N)
processed_data: (M, N)
min_per_col:       (N,)
max_per_col:       (N,)

Notice that min_per_col and max_per_col line up perfectly as they should. That means that your entire loop becomes simply

processed_data = (data - min_per_col) / (max_per_col - min_per_col)
#                    (M, N)                         (N,)
#                                   (M, N)

The comments under each operator show the shape of the broadcasted output.

As an aside, you can compute the denominator in a single step using np.ptp:

processed_data = (data - np.min(data, axis=0)) / np.ptp(data, axis=0)
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • Worked perfectly! Thank you so much for the quick help, I definitely learned something new -- your explanations were very helpful :) – Griffin Beels Apr 08 '21 at 20:36