2

So scikit-learn's DBSCAN takes in sparse matrices, and if the matrix isn't of csr_matrix format, converts it to such. I'd like to parse in a csr_matrix, but then I get this warning:

EfficiencyWarning: Precomputed sparse input was not sorted by data.

How do I create a data-sorted csr_matrix? If I initialize the matrix data-sorted, the matrix automatically index-sorts it:

>>> from scipy.sparse import csr_matrix
>>> x = csr_matrix(([1,2,3],[[3,2,1],[5,2,1]]))
>>> print(x)
  (1, 1)    3
  (2, 2)    2
  (3, 5)    1

I know csr_matrix has a has_sorted_indices flag, but I'm not sure how to use it. Even if I set it to false, the matrix is still sorted by indices.

Edited: I tried sorted_indices but it doesn't seem to change anything. I'm not sure if my concept of sorted_indices is correct? Is it supposed to sort the data from low to high per row?

>>> from scipy.sparse import csr_matrix
>>> x = csr_matrix(([7,3,5,1,6,2], [[0,1,2,0,1,2],[0,0,0,1,1,1]]), shape=(3, 2))
>>> print(x)
  (0, 0)    7
  (0, 1)    1
  (1, 0)    3
  (1, 1)    6
  (2, 0)    5
  (2, 1)    2
>>> x.has_sorted_indices = False
>>> x.sort_indices()
>>> print(x)
  (0, 0)    7
  (0, 1)    1
  (1, 0)    3
  (1, 1)    6
  (2, 0)    5
  (2, 1)    2

What I want (is this possible or no?)

  (0, 1)    1
  (0, 0)    7
  (1, 0)    3
  (1, 1)    6
  (2, 1)    2
  (2, 0)    5

Basically I need this to return True:

out_of_order = graph.data[:-1] > graph.data[1:]
line_change = np.unique(graph.indptr[1:-1] - 1)
line_change = line_change[line_change < out_of_order.shape[0]]
return (out_of_order.sum() == out_of_order[line_change].sum())
narcissa
  • 309
  • 1
  • 8
  • I think you need to find out more about this `sorted by data` requirement. There's nothing in the `scipy.sparse` that will do this for you. I'm not even sure what it means. The `sorted_indices` has to do with the column indices of the nonzero values - sort within each row. – hpaulj Sep 04 '20 at 00:04
  • @hpaulj thanks for the response. Actually, I looked into what's throwing that error message and that's exactly what I need -- the column indices need to be sorted within each row (by value)... how do i get it to be that way? – narcissa Sep 04 '20 at 01:55
  • `sort_indices` - https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.sort_indices.html#scipy.sparse.csr_matrix.sort_indices – hpaulj Sep 04 '20 at 02:19
  • @hpaulj I updated the question on what I'm looking for. I might be completely misunderstanding what `sort_indices` does. I'm looking to sort elements within each row by their values. Is this doable or am I completely missing the point here? Thanks! – narcissa Sep 04 '20 at 05:09
  • The method is called `sort_indices` not `sort` or `sort_data`! – hpaulj Sep 04 '20 at 06:30
  • @hpaulj So there's no way to sort data? – narcissa Sep 04 '20 at 17:09

1 Answers1

0

DBSCAN uses the nearest neighbors code, and this warning is logged here. Obviously, if you want to figure out the nearest samples of another sample, that can be done efficiently if they are sorted. DBSCAN has several (two probably) steps where it needs to look into the neighborhood of one sample, so this ordering makes sense.

A few lines lower you'll find the code to sort sparse matrices accordingly. Unfortunately, it's not wrapped as function so you'd need to 'functionify' it yourself if you want to prevent the warning. This only makes sense if you mind this warning and

  • you want to run DBSCAN more often on the same distance matrix, or
  • you want to avoid silencing warning, just in case DBSCAN will throw relevant other EfficiencyWarnings in the future.

The downside is that you have this nastily low-level function to maintain in your code base. You could try to get a PR accepted pulling it out as a function in scikit-learn if you're patient.

Alternatively silence the warning as per this or ignore it.

The function works in-place, so uncomment graph = graph.copy() if you don't want that.

def sort_values(graph):
  # graph = graph.copy()
  row_nnz = np.diff(graph.indptr)
  if row_nnz.max() == row_nnz.min():
    n_samples = graph.shape[0]
    distances = graph.data.reshape(n_samples, -1)

    order = np.argsort(distances, kind="mergesort")
    order += np.arange(n_samples)[:, None] * row_nnz[0]
    order = order.ravel()
    graph.data = graph.data[order]
    graph.indices = graph.indices[order]
  else:
    for start, stop in zip(graph.indptr, graph.indptr[1:]):
      order = np.argsort(graph.data[start:stop], kind="mergesort")
      graph.data[start:stop] = graph.data[start:stop][order]
      graph.indices[start:stop] = graph.indices[start:stop][order]
  return graph


sort_values(sparse_distance_matrix)

Alternatively silence the warning as per this or ignore it.

If you get specific warnings like this, my tip is to search the appropriate github or even google for this, particularly using quotes to on several unique words literally find code containing that string. Once you've located the code, you can reverse engineer, monkey-patch debug print statements to your local, trash-able virtualenv's version.

Herbert
  • 5,279
  • 5
  • 44
  • 69