0

I'm writing a code to analyze outliers with KNN, when I make the matrix (70k x 70k)it is too big for my RAM (36GB) so I separated them in 7 matrices of 10k x 10k elements with the next code:

matrices = []
for i in range(7):
    matrices.append(np.zeros([10000, 10000]))

for matrix in matrices:
    for i in range(10000 * matrices.index(matrix), 10000 * (matrices.index(matrix) + 1)):
        for j in range(10000 * matrices.index(matrix), 10000 * (matrices.index(matrix) + 1)):
            distance = mt.sqrt((x[i] - x[j]) ** 2 + (y[i] - y[j]) ** 2)
            matrix[i, j] = round(distance, 3)

but when I run it (of course 25 minutes later) it shows the next error:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In the line for i in range(10000 * matrices.index(matrix), 10000 * (matrices.index(matrix) + 1)): I can´t find anything about this, since I'm not actually asking for the truth value.

(It's for a hw and don't have the time to learn and use pytables)

  • In the end I separated the matrix in 49 smaller matrices and analyzed them separately, not the most optimized method, but worked, Thank for the answers!! – Matias Vidal Jul 18 '20 at 21:50

2 Answers2

0

matrices is a list and matrices.index(matrix) attempts to find the object matrix in that list. Effectively it will compare it against each of the list's elements and check whether if matrix == element: return index. However since matrix is an array this == comparison yields another bool-array which can't be interpreted as a single bool value.

a_guest
  • 34,165
  • 12
  • 64
  • 118
  • So I should make the matrix as a list? But how can I make it full of an element and then replace it? Or I should do a ```for i in range(7)``` instead of rhe index? – Matias Vidal Jul 16 '20 at 23:15
  • @MatiasVidal I'm not sure what you are trying to do with that code. Could you elaborate? – a_guest Jul 16 '20 at 23:27
  • I need a matrix of distances between 70k points so I can compare those to use KNN method and study the outliers of data. – Matias Vidal Jul 16 '20 at 23:39
  • @MatiasVidal So you consider only subsets of 10k points each? That's not the same thing. – a_guest Jul 16 '20 at 23:44
0

Note that you have to divide a 70kx70k matrix into 7x7=49 of 10kx10k matrices. Besides that, did you mean to do this?:

matrices = []
for i in range(7):
    for j in range(7):
        matrices.append(np.zeros([10000, 10000]))
    
for i in range(70000):
    for j in range(70000):
        distance = mt.sqrt((x[i] - x[j]) ** 2 + (y[i] - y[j]) ** 2)
        matrices[7*i//10000+j//10000][i%10000, j%10000] = round(distance, 3)

Although, I don't see how this is different than fitting your entire matrix in RAM. matrices is basically the same thing. And on top of that, there are functions in numpy to chunk arrays into blocks and also calculate Euclidean distance of matrix. You might want to leverage that.

Ehsan
  • 12,072
  • 2
  • 20
  • 33
  • Oh, you're right, it should be 49 matrices. What functions are you talking about? I don't know them :( – Matias Vidal Jul 16 '20 at 23:38
  • To block chunk array either use `reshape` or `view_as_windows` from scipy, and to calculate distance (there are many more) use `np.linalg.norm(x-x[:,None],1)` (for x and y) or even faster, the scipy `distance` function. Check out https://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy for different distance functions available. Nonetheless, I do not see how this way of chunking is going to let you run in RAM by pieces. You might need to chunk it and save it to file to be able to load piece by piece and do your calculations. – Ehsan Jul 16 '20 at 23:45