How to convert two associated arrays so that elements are evenly distributed?

Question

There are two arrays, an array of images and an array of the corresponding labels. (e.g pictures of figures and it's values) The occurrences in the labels are unevenly distributed. Distribution of the labels

What I want is to cut both arrays in such a way, that the labels are evenly distributed. E.g. every label occurs 2 times.

To test I've just created two 1D arrays and it was working:

labels = np.array([1, 2, 3, 3, 1, 2, 1, 3, 1, 3, 1,])
images = np.array(['A','B','C','C','A','B','A','C','A','C','A',])
x, y = zip(*sorted(zip(images, labels)))

label = list(set(y))
new_images = []
new_labels = []
amount = 2

for i in label:
    start = y.index(i)
    stop = start + amount
    new_images = np.append(new_images, x[start: stop])
    new_labels = np.append(new_labels, y[start: stop])

What I get/want is this:

new_labels:  [ 1.  1.  2.  2.  3.  3.]
new_images:  ['A' 'A' 'B' 'B' 'C' 'C']

(It is not necessary, that the arrays are sorted)

But when I tried it with the right data (images.shape = (35000, 32, 32, 3), labels.shape = (35000)) I've got an error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

This does not help me a lot: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I think that my solution is quite dirty anyhow. Is there a way to do it right?

Thank you very much in advance!

Do you have a stacktrace? Please post it. At the very least, which statement fails? — 9000, Jan 16 '18 at 18:45
Error occurs in this line: x, y = zip(*sorted(zip(images, labels))) But the answer from Hielke Walinga explains the problem very good. I've had to replace the line with: x, y = zip(*sorted(zip(images, labels), key=itemgetter(1))) to sort over the labels and not over images. — Rombrand, Jan 17 '18 at 14:06

Hielke Walinga · Answer 1 · 2018-01-17T16:37:24.950

When your labels are equal, the sort function tries to sort on the second value of the tuples it has as input, since this is an array in the case of your real data, (instead of the 1D data), it cannot compare them and raises this error.

Let me explain it a bit more detailed:

x, y = zip(*sorted(zip(images, labels)))

First, you zip your images and labels. What this means, is that you create tuples with the corresponding elements of images and lables. The first element from images by the first element of labels, etc.

In case of your real data, each label is paired with an array with shape (32, 32, 3).

Second you sort all those tuples. This function tries first to sort on the first element of the tuple. However, when they are equal, it will try to sort on the second element of the tuples. Since they are arrays it cannot compare them en throws an error.

You can solve this by explicitly telling the sorted function to only sort on the first tuple element.

x, y = zip(*sorted(zip(images, labels), key=lambda x: x[0]))

If performance is required, using itemgetter will be faster.

from operator import itemgetter
x, y = zip(*sorted(zip(images, labels), key=itemgetter(0)))

Thank you very much! I got the point which make the problem a lot clearer. I‘ve seen this code lines in the forum but without any explanation so I could not understand it. But I think you ment the to sort over the secon element (labels), so the line looks like this: x, y = zip(*sorted(zip(images, labels), key=itemgetter(1))) — Rombrand, Jan 17 '18 at 14:16
Yes, but don't forget to import itemgetter in that case. Now that itemgetter is basically the same as the lambda function, but since it is a function implemented in C rather than in Python it is slightly faster. But this is only true for very large data, since you also have the import overhead. — Hielke Walinga, Jan 17 '18 at 16:39

How to convert two associated arrays so that elements are evenly distributed?

1 Answers1