-2

I want to group the numbers in different regions automatically. As per the figure (sample dataset), we can see that their are three regions in which numbers are lying i.e. [0,100], [650,750], [1220, 1300]. I want to point those regions only. There can be any number of such regions. We need to automatically find the no. of such regions and range of such regions. The distance between two regions will be considerably very high. Is there any way I can do this in Python?

Sample data = [69,  8,  30, 45, 89, 61, 80, 45, 9,  18, 19, 11, 1255,   1299,   1296,   1293,   1287,   1250,   1265,   1291,   1281,   1250,   1286,   1286,   1251,   1287,   1266,   1288,   1254,   1260,   1260,   1254,   1267,   1299,   1273,   1250,   1300,   1250,   1279,   1255,   1293,   1292,   1278,   1277,   1252,   1299,   1278,   1258,   1268,   1274,   1285,   1258,   1279,   1270,   1278,   1286,   1278,   1253,   1267,   1300,   1295,   1298,   1285,   1288,   1274,   1272,   1252,   1256,   1283,   1289,   1251,   1258,   1253,   1257,   1297,   1269,   1292,   1253,   1273,   1281,   1251,   1280,   1253,   1274,   1275,   1287,   1296,   1298,   1296,   1291,   1284,   1261,   1267,   1290,   1273,   1281,   1263,   1270,   1264,   1269,   1278,   1284,   67, 8,  40, 59, 97, 64, 45, 72, 45, 90, 94, 7,  33, 58, 97, 97, 1252,   1297,   1265,   1278,   1272,   1252,   1258,   1261,   1287,   1260,   1260,   1258,   1280,   1263,   1256,   1296,   1269,   1270,   1296,   1282,   696,    678,    665,    700,    700,    691,    689,    688,    650,    663,    662,    698,    655,    660,    662,    684,    690,    657,    653,    663,    670,    691,    687,    675,    694,    670,    676,    659,    661,    664,    664,    689,    683,    675,    687,    691,    676,    659,    689,    657,    659,    656,    654,    679,    669,    687,    666,    662,    691,    1260,   1276,   1252,   1295,   1257,   1277,   1281,   1257,   1295,   1269,   1265,   1290,   1266,   1269,   1286,   1254,   1260,   1265,   1290,   1294,   1286,   1279,   1254,   1256,   1276,   1285,   1282,   1251,   1282,   1261,   1253,   56, 74, 85, 94, 18, 83, 38, 80, 8,  4,  78, 43, 7,  79, 68, 78, 1275,   1250,   1268,   1297,   1284,   1255,   1294,   1262,   1250,   1252,   680,    693,    677,    676,    670,    653,    670,    661,    658,    695,    665,    671,    656,    686,    662,    691,    675,    658,    671,    650,    667,    653,    652,    686,    667,    682,    694,    654,    689,    682,    667,    658,    651,    652,    692,    652,    655,    651,    650,    698,    655,    650,    679,    672,    697,    696,    696,    683,    1277,   1264,   1274,   1260,   1285,   1285,   1283,   1259,   1260,   1288,   1281,   1284,   1281,   1257,   1285,   1295,   1273,   1264,   1283,   1284,   1300,   1299,   1257,   1297,   1254,   1257,   1270,   1257,   1295,   34, 5,  73, 42, 27, 36, 91, 85, 19, 50, 34, 21, 73, 38, 18, 73]
ThePyGuy
  • 17,779
  • 5
  • 18
  • 45
Ashish
  • 4,206
  • 16
  • 45
  • What output do you want in this case ? Can you give a data sample (not an image) ? – B. M. Feb 23 '17 at 13:26
  • If possible, I want this as output: [[0,100], [650,750], [1220, 1300]]. These are the ranges in which all data is lying.Basically there is a large list of data, out of which few nos. Lie in range 0,100 then there is a large gap and then few nos. Lie between 650 to 750 and similarly after a large gap few data in between1220 to 1300. – Ashish Feb 23 '17 at 13:33

3 Answers3

1

I referred Unsupervised clustering with unknown number of clusters as suggested by @schwobaseggl and changed the code a bit as per my need. Here is the new code:

import numpy
import scipy.cluster.hierarchy as hcluster

temp_data = [31,68,74,46,47,83,29,11,9,52,1272,1288,1297,1285,1294,1251,1265,1257,1280,1265,1292,1297,1271,1273,1253,1273,1291,1251,1295,1298,1264,1281,1294,1280,1250,1279,1298,1290,1294,1299,1266,1260,1298,1292,1280,1259,1266,1276,1253,1252,1271,1280,1284,1266,1254,1259,1291,1268,1253,1298,1288,1271,1298,1300,1274,1294,1263,1298,1270,1254,1266,1269,1283,1285,1286,1276,1257,1266,1272,1298,1261,1251,1272,1260,1291,1269,1260,1294,1287,1256,1253,1284,1269,1287,1292,1269,1272,1275,1250,1289,56,35,19,80,47,22,92,8,10,24,87,76,60,63,64,0,1295,1268,1280,1281,1277,1300,1278,1273,1250,1296,1266,1269,1282,1281,1272,1260,1292,1272,1253,1255,1299,1269,1268,1294,1250,1299,1292,1254,1281,1289,1259,1290,1271,1280,1272,1300,1258,1290,1289,1300,1299,1261,1300,1276,1290,1299,1280,1267,1283,1282,1269,1260,1285,1252,1250,1263,1297,1300,1292,1266,1260,1263,1292,1296,1289,1297,1251,1261,1250,1294,1278,1284,1291,1281,1269,1261,1257,1267,1265,1288,1291,1257,1296,1251,1260,1272,1294,1285,1269,1283,1297,1287,1253,1292,1299,1295,1286,1288,1283,1290,20,73,81,6,49,88,96,61,49,94,57,16,61,16,17,19,1280,1257,1259,1277,1257,1262,1263,1280,1292,1250,1287,1272,1258,1253,1285,1285,1257,1291,1273,1260,1267,1250,1280,1281,1263,1269,1292,1250,1282,1263,1274,1288,1296,1266,1291,1271,1273,1281,1261,1289,1269,1287,1296,1283,1280,1298,1259,1270,1259,1289,1269,1284,1295,1297,1256,1300,1281,1296,1284,1288,1285,1296,1277,1251,1279,1295,1281,1264,1280,1263,69,8,30,45,89,61,80,45,9,18,19,11,1255,1299,1296,1293,1287,1250,1265,1291,1281,1250,1286,1286,1251,1287,1266,1288,1254,1260,1260,1254,1267,1299,1273,1250,1300,1250,1279,1255,1293,1292,1278,1277,1252,1299,1278,1258,1268,1274,1285,1258,1279,1270,1278,1286,1278,1253,1267,1300,1295,1298,1285,1288,1274,1272,1252,1256,1283,1289,1251,1258,1253,1257,1297,1269,1292,1253,1273,1281,1251,1280,1253,1274,1275,1287,1296,1298,1296,1291,1284,1261,1267,1290,1273,1281,1263,1270,1264,1269,1278,1284,67,8,40,59,97,64,45,72,45,90,94,7,33,58,97,97,1252,1297,1265,1278,1272,1252,1258,1261,1287,1260,1260,1258,1280,1263,1256,1296,1269,1270,1296,1282,696,678,665,700,700,691,689,688,650,663,662,698,655,660,662,684,690,657,653,663,670,691,687,675,694,670,676,659,661,664,664,689,683,675,687,691,676,659,689,657,659,656,654,679,669,687,666,662,691,1260,1276,1252,1295,1257,1277,1281,1257,1295,1269,1265,1290,1266,1269,1286,1254,1260,1265,1290,1294,1286,1279,1254,1256,1276,1285,1282,1251,1282,1261,1253,56,74,85,94,18,83,38,80,8,4,78,43,7,79,68,78,1275,1250,1268,1297,1284,1255,1294,1262,1250,1252,680,693,677,676,670,653,670,661,658,695,665,671,656,686,662,691,675,658,671,650,667,653,652,686,667,682,694,654,689,682,667,658,651,652,692,652,655,651,650,698,655,650,679,672,697,696,696,683,1277,1264,1274,1260,1285,1285,1283,1259,1260,1288,1281,1284,1281,1257,1285,1295,1273,1264,1283,1284,1300,1299,1257,1297,1254,1257,1270,1257,1295,34,5,73,42,27,36,91,85,19,50,34,21,73,38,18,73]

ndata = [[td, td] for td in temp_data]
data = numpy.array(ndata)

# clustering
thresh = (11.0/100.0) * (max(temp_data) - min(temp_data))  #Threshold 11% of the total range of data

clusters = hcluster.fclusterdata(data, thresh, criterion="distance")

total_clusters = max(clusters)

clustered_index = []
for i in range(total_clusters):
    clustered_index.append([])

for i in range(len(clusters)):
    clustered_index[clusters[i] - 1].append(i)

clustered_range = []
for x in clustered_index:
    clustered_index_x = [temp_data[y] for y in x]
    clustered_range.append((min(clustered_index_x) , max(clustered_index_x)))

print clustered_range

i have chosen threshold value (thres) as 11% of the total range of data

so the output for this dataset is:

[(0, 97), (1250, 1300), (650, 700)]
Community
  • 1
  • 1
Ashish
  • 4,206
  • 16
  • 45
0

One possibility would be to use sklearns Gaussian Mixture Model to find the clusters. Maybe it is a bit of a overkill as there is only one dimension, but it should work anyhow.

Here is an example of how it could be done:

import numpy as np
from sklearn import mixture

# Generate some example data
centers = [50, 700, 1250]  # Approximate
y = []
for i in range(20):
    y.append(np.random.randn(100) * 50 + centers[np.random.randint(len(centers))])
y = np.c_[np.concatenate(y)]

N = 3  # Clusters
gmix = mixture.GaussianMixture(n_components=3, covariance_type='full')
gmix.fit(y)  # Now it thinks it is trained

for i in range(N):
    center = gmix.means_[i][0]
    std = np.sqrt(gmix.covariances_[i][0])
    print "Range %d: %.0f to %.0f" % (i + 1, center - std, center + std)

And it should output something like (using IPython 2.7):

Range 1: 1199 to 1298
Range 2: 649 to 752
Range 3: 4 to 102

It will of course depend on the clusters it finds, which might not always be a nice solution.

J. P. Petersen
  • 4,871
  • 4
  • 33
  • 33
  • but you have predefined **N = 3 # Clusters**. But we won't have any such information in the beginning. We need to automatically find how many such regions are there in the data set. – Ashish Feb 24 '17 at 06:18
  • Well you modified you question after I had given an answer, so it would be a bit hard for me to know that you also would like to find the number of clusters. But you could for example use the Bayesian information criterion (BIC) score to determine that. – J. P. Petersen Feb 24 '17 at 11:47
0

You may be looking for k-means clustering.

from typing import Tuple, Iterable, Sequence, List, Dict, DefaultDict
from random import sample
from math import fsum, sqrt
from collections import defaultdict
from functools import partial

Point = Tuple[int, ...]
Centroid = Point

def mean(data: Iterable[float]) -> float:
    'Accurate arithmetic mean'
    data = list(data)
    return fsum(data) / len(data)

def dist(p: Point, q: Point, sqrt=sqrt, fsum=fsum, zip=zip) -> float:
    'Euclidean distance'
    return sqrt(fsum((x - y) ** 2.0 for x, y in zip(p, q)))

def assign_data(centroids: Sequence[Centroid], data: Iterable[Point]) -> Dict[Centroid, Sequence[Point]]:
    'Assign data the closest centroid'
    d = defaultdict(list)             # type: DefaultDict[Point, List[Point]]
    for p in data:
        centroid = min(centroids, key=partial(dist, p))  # type: Point
        d[centroid].append(p)
    return dict(d)

def compute_centroids(groups: Iterable[Sequence[Point]]) -> List[Centroid]:
    'Compute the centroid of each group'
    return [tuple(map(mean, zip(*pts))) for pts in groups]

def k_means(data: Iterable[Point], k:int=2, iterations:int=10) -> List[Point]:
    'Return k-centroids for the data'
    data = list(data)
    centroids = sample(data, k)
    for i in range(iterations):
        labeled = assign_data(centroids, data)
        centroids = compute_centroids(labeled.values())
    return centroids

Here it is applied to your problem:

data = [69, 8, 30, 45, 89, 61, 80, 45, 9, 18, 19, 11, 1255, 1299,
        1296, 1293, 1287, 1250, 1265, 1291, 1281, 1250, 1286,
        1286, 1251, 1287, 1266, 1288, 1254, 1260, 1260, 1254,
        1267, 1299, 1273, 1250, 1300, 1250, 1279, 1255, 1293,
        1292, 1278, 1277, 1252, 1299, 1278, 1258, 1268, 1274,
        1285, 1258, 1279, 1270, 1278, 1286, 1278, 1253, 1267,
        1300, 1295, 1298, 1285, 1288, 1274, 1272, 1252, 1256,
        1283, 1289, 1251, 1258, 1253, 1257, 1297, 1269, 1292,
        1253, 1273, 1281, 1251, 1280, 1253, 1274, 1275, 1287,
        1296, 1298, 1296, 1291, 1284, 1261, 1267, 1290, 1273,
        1281, 1263, 1270, 1264, 1269, 1278, 1284, 67, 8, 40, 59,
        97, 64, 45, 72, 45, 90, 94, 7, 33, 58, 97, 97, 1252, 1297,
        1265, 1278, 1272, 1252, 1258, 1261, 1287, 1260, 1260,
        1258, 1280, 1263, 1256, 1296, 1269, 1270, 1296, 1282, 696,
        678, 665, 700, 700, 691, 689, 688, 650, 663, 662, 698,
        655, 660, 662, 684, 690, 657, 653, 663, 670, 691, 687,
        675, 694, 670, 676, 659, 661, 664, 664, 689, 683, 675,
        687, 691, 676, 659, 689, 657, 659, 656, 654, 679, 669,
        687, 666, 662, 691, 1260, 1276, 1252, 1295, 1257, 1277,
        1281, 1257, 1295, 1269, 1265, 1290, 1266, 1269, 1286,
        1254, 1260, 1265, 1290, 1294, 1286, 1279, 1254, 1256,
        1276, 1285, 1282, 1251, 1282, 1261, 1253, 56, 74, 85, 94,
        18, 83, 38, 80, 8, 4, 78, 43, 7, 79, 68, 78, 1275, 1250,
        1268, 1297, 1284, 1255, 1294, 1262, 1250, 1252, 680, 693,
        677, 676, 670, 653, 670, 661, 658, 695, 665, 671, 656,
        686, 662, 691, 675, 658, 671, 650, 667, 653, 652, 686,
        667, 682, 694, 654, 689, 682, 667, 658, 651, 652, 692,
        652, 655, 651, 650, 698, 655, 650, 679, 672, 697, 696,
        696, 683, 1277, 1264, 1274, 1260, 1285, 1285, 1283, 1259,
        1260, 1288, 1281, 1284, 1281, 1257, 1285, 1295, 1273,
        1264, 1283, 1284, 1300, 1299, 1257, 1297, 1254, 1257,
        1270, 1257, 1295, 34, 5, 73, 42, 27, 36, 91, 85, 19, 50,
        34, 21, 73, 38, 18, 73]

points = [(x,) for x in data]
centroids = k_means(points, k=3, iterations=100)
clusters = assign_data(centroids, points).values()
for cluster in clusters:
    print(f'{min(cluster)[0]} to {max(cluster)[0]}')

This outputs:

4 to 97
650 to 700
1250 to 1300
Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
  • @schwobaseggl _Unsupervised clustering with unknown number of clusters_ deals with 3 dimension data.I want just the range in which my numbers are present. – Ashish Feb 24 '17 at 06:44
  • but in K-Means Clustering number of K's are predefined. But in my data set i won't have any knowledge of how many clusters i will have, – Ashish Feb 24 '17 at 06:48
  • Then your problem is ill-defined. Without constraints, there are trivial solutions (all of the points are in one cluster or every point is its own cluster of size one). – Raymond Hettinger Feb 24 '17 at 06:54
  • Also, the k-means code works with any dimensionality. The example was given with three dimensions but it works fine with only a single dimension. – Raymond Hettinger Feb 24 '17 at 06:56