4

I have a 2D numpy array with about 12 columns and 1000+ rows and each cell contains a number from 1 to 5. I'm searching for the best sextuple of columns according to my point system where 1 and 2 generate -1 point and 4 and 5 gives +1.

If a row in a certain sextuple contains, for example, [1, 4, 5, 3, 4, 3] the point for this row should be +2, because 3*1 + 1*(-1) = 2. Next row may be [1, 2, 2, 3, 3, 3] and should be -3 points.

At first, I tried a strait forward loop solution but I realized there are 665 280 possible combinations of columns to compare and when I also need to search for the best quintuple, quadruple etc. the loop is taking forever.

Is there perhaps a smarter numpy-way of solving my problem?

onyx
  • 41
  • 1
  • 3
  • 2
    Can you post your loop solution? Sometimes it's easier to optimize already working code, rather than trying to re-invent the wheel ... – mgilson Sep 05 '12 at 14:43
  • Another advantage to posting your solution is that it resolves ambiguities. For example, I'm not sure if you want to find the six columns which give the maximum total if you sum over those columns for each row (which is very easy) or something else. – DSM Sep 05 '12 at 14:51
  • It might also help to know more about your dataset. For example, it sounds like you're willing to accept any six answers from one row- if each row is one observation, why can the rest be rejected? Can your data array be restructured in some way to simplify the search space? – abought Sep 05 '12 at 14:52
  • I can't think of an interpretation of this question where the result of searching through the combinations is ever going to be different from sorting and taking the largest N. – chthonicdaemon Sep 05 '12 at 16:52
  • are you looking for the combination of columns that maximizes just a single row or the whole matrix? – Daniel Sep 05 '12 at 23:07

3 Answers3

1
import numpy as np
import itertools

N_rows = 10
arr = np.random.random_integers(5, size=(N_rows,12))
x = np.array([0,-1,-1,0,1,1])
y = x[arr]

print(y)

score, best_sextuple = max((y[:,cols].sum(), cols)
                           for cols in itertools.combinations(range(12),6))
print('''\
score: {s}
sextuple: {c}
'''.format(s = score, c = best_sextuple))

yields, for example,

score: 6
sextuple: (0, 1, 5, 8, 10, 11)

Explanation:

First, let's generate a random example, with 12 columns and 10 rows:

N_rows = 10
arr = np.random.random_integers(5, size=(N_rows,12))

Now we can use numpy indexing to convert the numbers in arr 1,2,...,5 to the values -1,0,1 (according to your scoring system):

x = np.array([0,-1,-1,0,1,1])
y = x[arr]

Next, let's use itertools.combinations to generate all possible combinations of 6 columns:

for cols in itertools.combinations(range(12),6)

and

y[:,cols].sum()

then gives the score for cols, a choice of columns (sextuple).

Finally, use max to pick off the sextuple with the best score:

score, best_sextuple = max((y[:,cols].sum(), cols)
                           for cols in itertools.combinations(range(12),6))
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
1
import numpy

A = numpy.random.randint(1, 6, size=(1000, 12))
points = -1*(A == 1) + -1*(A == 2) + 1*(A == 4) + 1*(A == 5)
columnsums = numpy.sum(points, 0)

def best6(row):
    return numpy.argsort(row)[-6:]

bestcolumns = best6(columnsums)
allbestcolumns = map(best6, points)

bestcolumns will now contain the best 6 columns in ascending order. By similar logic, allbestcolumns will contain the best six columns in each row.

chthonicdaemon
  • 19,180
  • 2
  • 52
  • 66
  • This is how I originally interpreted the question but other people have given an equally plausible reading. I would use `.argsort()[-6:]` instead, though. – DSM Sep 05 '12 at 15:58
  • I've changed it to `argsort`, but I'm a bit new here, so I'm not sure about the etiquette of incorporating suggestions like that in my answer. This comment serves as disclosure. – chthonicdaemon Sep 05 '12 at 16:17
  • Welcome to SO, then! If it's obvious from the comments, then there's no need to credit (unless it's really super-clever, but this was just a minor variation). Often higher-karma types will make comments on the answer closest in spirit to what they would write rather than writing their own, I've noticed. Aside: rather than using `best6`, though, I might use `best` and make the 6 a parameter. – DSM Sep 05 '12 at 16:22
0

Extending on unutbu's longer answer above, it's possible to generate the masked array of scores automatically. Since your scores for values are consistent every pass through the loop, so the scores for each value only need to be calculated once. Here's slightly inelegant way to do it on an example 6x10 array, before and after your scores are applied.

>>> import numpy
>>> values = numpy.random.randint(6, size=(6,10))
>>> values
array([[4, 5, 1, 2, 1, 4, 0, 1, 0, 4],
       [2, 5, 2, 2, 3, 1, 3, 5, 3, 1],
       [3, 3, 5, 4, 2, 1, 4, 0, 0, 1],
       [2, 4, 0, 0, 4, 1, 4, 0, 1, 0],
       [0, 4, 1, 2, 0, 3, 3, 5, 0, 1],
       [2, 3, 3, 4, 0, 1, 1, 1, 3, 2]])
>>> b = values.copy()
>>> b[ b<3 ] = -1

>>> b[ b==3 ] = 0
>>> b[ b>3 ] = 1
>>> b
array([[ 1,  1, -1, -1, -1,  1, -1, -1, -1,  1],
       [-1,  1, -1, -1,  0, -1,  0,  1,  0, -1],
       [ 0,  0,  1,  1, -1, -1,  1, -1, -1, -1],
       [-1,  1, -1, -1,  1, -1,  1, -1, -1, -1],
       [-1,  1, -1, -1, -1,  0,  0,  1, -1, -1],
       [-1,  0,  0,  1, -1, -1, -1, -1,  0, -1]])

Incidentally, this thread claims that creating the combinations directly within numpy will yield around 5x faster performance than itertools, though perhaps at the expense of some readability.

Community
  • 1
  • 1
abought
  • 2,652
  • 1
  • 18
  • 13