2

I have a dataset with 2 parameters, looking like this (I have added density contour plots):

enter image description here

My goal is to separate this sample in 2 subsets like this:

enter image description here

This image comes from QUENCHING OF STAR FORMATION IN SDSS GROUPS:CENTRALS, SATELLITES, AND GALACTIC CONFORMITY, Knobel et. al., The Astrophysical Journal, 800:24 (20pp), 2015 February 1, available here. The separation line has been drawn by eye and is not perfect.

What I need is something like the red line (maximizing distances) in this nice Wikipedia graph:

enter image description here

Unfortunately, all linear classification that seem close to what I'm looking for (SVM, SVC, etc.) are supervised learning.

I have tried unsupervised learning, like KMeans 2 clusteers, this way(CompactSFR[['lgm_tot_p50','sSFR']] being the Pandas dataset you can find at the end of this post):

X = CompactSFR[['lgm_tot_p50','sSFR']]
from sklearn.cluster import KMeans

kmeans2 = KMeans(n_clusters=2)
# Fitting the input data
kmeans2 = kmeans2.fit(X)
# Getting the cluster labels
labels2 = kmeans2.predict(X)
# Centroid values
centroids = kmeans2.cluster_centers_
f, (ax1,ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 5), sharey=True)
ax1.scatter(CompactSFR['lgm_tot_p50'],CompactSFR['sSFR'],c=labels2);
X2 = kmeans2.transform(X)
ax1.set_title("Kmeans 2 clusters", fontsize=15)
ax1.set_xlabel('$\log_{10}(M)$',fontsize=10) ;
ax1.set_ylabel('sSFR',fontsize=10) ;
f.subplots_adjust(hspace=0)

but the classification I get is this:

enter image description here

Which doesn't work.

Furthermore, what I want is not a simple classification but the equation of the separation line (which is obviously very different from a linear regression).

I would like to avoid developing a Bayesian model of maximum likelihood if something already exists.

You can find a small sample (959 points) here.

NB : this question doesn't correspond to my case.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Matt
  • 763
  • 1
  • 7
  • 25
  • have you tried a mixture model with EM algo to separate the regions, followed by an SVM to find the line equation? Maybe turning the problem from unspervised to supervised will help obtain a better separation (the one you are looking for, so you are indeed supervising the training) (everything is available on sklearn) – Frayal Mar 13 '19 at 13:52
  • I have not tried Expectation-maximization, I'll have a look right now. – Matt Mar 13 '19 at 13:57

1 Answers1

1

The following code will do it with a Gaussian Mixture model of 2 components, and produces this result. result figure

First, read the data from your file and remove outliers:

import pandas as pd
import numpy as np
from sklearn.neighbors import KernelDensity

frm = pd.read_csv(FILE, index_col=0)
kd = KernelDensity(kernel='gaussian')
kd.fit(frm.values)
density = np.exp(kd.score_samples(frm.values))
filtered = frm.values[density>0.05,:]

Then fit a Gaussian Mixture Model:

from sklearn.mixture import GaussianMixture
model = GaussianMixture(n_components=2, covariance_type='full')
model.fit(filtered)
cl = model.predict(filtered)

To obtain the plot:

import matplotlib.pyplot as plt
plt.scatter(filtered[cl==0,0], filtered[cl==0,1], color='Blue')
plt.scatter(filtered[cl==1,0], filtered[cl==1,1], color='Red')
JARS
  • 1,109
  • 7
  • 10