DBSCAN with custom metric

Question

I have the following given:

a dataset in the range of thousands
a way of computing the similarity, but the datapoints themselves I cannot plot them in euclidian space

I know that DBSCAN should support custom distance metric but I dont know how to use it.

say I have a function

def similarity(x,y):
    return  similarity ...

and I have a list of data that can be passed pairwise into that function, how do I specify this when using the DBSCAN implementation of scikit-learn ?

Ideally what I want to do is to get a list of the clusters but I cant figure out how to get started in the first place.

There is a lot of terminology that still confuses me:

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

How do I pass a feature array and what is it ? How do I fit this implementation to my needs ? How will I be able to get my "sublists" from this algorithm ?

j4nw · Accepted Answer · 2018-12-05T07:35:50.493

12

A "feature array" is simply an array of the features of a datapoint in your dataset.

metric is the parameter you're looking for. It can be a string (the name of a builtin metric), or a callable. Your similarity function is a callable. This isn't well described in the documentation, but a metric has to do just that, take two datapoints as parameters, and return a number.

def similarity(x, y):
    return ...

reduced_dataset = sklearn.cluster.DBSCAN(metric=similarity).fit(dataset)

edited Dec 05 '18 at 07:35

answered Feb 13 '18 at 13:48

j4nw

2,227
11
26

thanks for the understandable answer, one more question, what will the algoritm return ? will I have to iterate over the whole array again to get a label for each item or how does this work ? – zython Feb 13 '18 at 13:57
2

DBSCAN returns a 2 by y numpy matrix (for an x by y numpy matrix dataset). If your dataset has labels as the first column, you'd extract these first. Look at pandas dataframes - you can easily use them to split datasets into labels and raw numbers/datapoints. – j4nw Feb 13 '18 at 14:02

score 6 · Answer 2 · answered Mar 29 '18 at 13:44

In case someone is searching the same for strings with a custom metric

    def metric(x, y):
        return yourDistFunc(string_seqs[int(x[0])],string_seqs[int(y[0])])
    def clusterPockets():          
        global string_seqs
        string_seqs = load_data() #["foo","bar"...]
        dat = np.arange(len(string_seqs)).reshape(-1, 1)
        clustered_dataset = DBSCAN(metric=metric)).fit(X=dat, y=dat)

DBSCAN with custom metric

2 Answers2