2

First off, this is not a duplicate post; my question is different than the ones I searched on this site, but please feel free to link if you find an already answered question


Description:

If you think how your own mind finds out 10 and 2.10 to be first element which is not close, in A and B below, that is what I am trying to do programmatically. A hard coded threshold value is not the best option. Of-course, we need a threshold here, but the function should find the threshold based on values provided, so in the case of A, threshold might be around 1.1, and 0.01 for B. How? Well, "it makes sense" right? We looked at the values and figured out. That is what I meant, "dynamic threshold" per se, if your answer includes using threshold.

A = [1.1, 1.02, 2.3, 10, 10.01, 10.1, 12, 16, 18, 18]
B = [1.01, 1.02, 1.001, 1.03, 2.10, 2.94, 3.01, 8.99]

Python Problem:

I have 2D list in Python which looks like below, now if want to narrow down the items which are closer, to each other starting only from top to bottom ( the list is already sorted as you can notice ), we can easily find out that first four are quite closer to each other than 4th and 5th are.

subSetScore = [
    ['F', 'H', 0.12346022214809049],
    ['C', 'E', 0.24674283702138702],
    ['C', 'G', 0.24675055907681284],
    ['E', 'G', 0.3467125665641178],
    ['B', 'D', 0.4720531092083966],
    ['A', 'H', 0.9157739970594413],
    ['A', 'C', 0.9173801845880128],
    ['A', 'G', 0.9174496830868454],
    ['A', 'B', 0.918924595673178],
    ['A', 'F', 0.9403919097569715],
    ['A', 'E', 0.9419672090638398],
    ['A', 'D', 0.9436390340635308], 
    ['B', 'H', 1.3237456293166292],
    ['D', 'H', 1.3237456293166292],
    ['D', 'F', 1.3238460160371646],
    ['B', 'C', 1.3253518168452008],
    ['D', 'E', 1.325421315344033],
    ['D', 'G', 1.325421315344033],
    ['B', 'F', 1.349344243053239],
    ['B', 'E', 1.350919542360107],
    ['B', 'G', 1.350919542360107],
    ['C', 'H', 1.7160260449485403],
    ['E', 'H', 1.7238716532611786],
    ['G', 'H', 1.7238716532611786],
    ['E', 'F', 1.7239720399817142],
    ['C', 'F', 1.7416246586851503],
    ['C', 'D', 1.769389308968704],
    ['F', 'G', 2.1501908892101267]
]

Result:

closest = [
    ['F', 'H', 0.12346022214809049],
    ['C', 'E', 0.24674283702138702],
    ['C', 'G', 0.24675055907681284],
    ['E', 'G', 0.3467125665641178],
    ['B', 'D', 0.4720531092083966]
]

As opposite to other questions I have observed here, where the 1D or 2D list is given and an arbitrary value let’s say 0.9536795380033108, then the function has to find that 0.9436390340635308 is the closest from the list, and the mostly the solutions use absolute difference to calculate it, but it does not seem to be applicable here.

One approach which seem to be partially reliable was to calculate cumulative difference, as following.

consecutiveDifferences = []

for index, item in enumerate(subSetScore):
    if index == 0:
        continue
    consecutiveDifferences.append([index, subSetScore[index][2] - subSetScore[index - 1][2]])

This gives me following:

consecutiveDifferences = [
    [1, 0.12328261487329653],
    [2, 7.722055425818386e-06],
    [3, 0.09996200748730497],
    [4, 0.1253405426442788],
    [5, 0.4437208878510447],
    [6, 0.0016061875285715566],
    [7, 6.949849883253201e-05],
    [8, 0.0014749125863325885],
    [9, 0.021467314083793543],
    [10, 0.001575299306868283],
    [11, 0.001671824999690985],
    [12, 0.3801065952530984],
    [13, 0.0],
    [14, 0.00010038672053536146],
    [15, 0.001505800808036195],
    [16, 6.949849883230996e-05],
    [17, 0.0],
    [18, 0.0239229277092059],
    [19, 0.001575299306868061],
    [20, 0.0],
    [21, 0.36510650258843325],
    [22, 0.007845608312638364],
    [23, 0.0],
    [24, 0.00010038672053558351],
    [25, 0.01765261870343604],
    [26, 0.027764650283553793],
    [27, 0.38080158024142263]
]

And now, the index of difference more than the difference on 0th index is my cutoff index as following:

cutoff = -1
for index, item in enumerate(consecutiveDifferences):
    if index == 0:
        continue
    if consecutiveDifferences[index][1] > consecutiveDifferences[0][1]:
        cutoff = index
        break

    cutoff = cutoff+1
    closest = subSetScore[:cutoff+1]

Which would leave my list (closest) as following:

consecutiveDifferences = [
    ['F', 'H', 0.12346022214809049],
    ['C', 'E', 0.24674283702138702],
    ['C', 'G', 0.24675055907681284],
    ['E', 'G', 0.3467125665641178],
    ['B', 'D', 0.4720531092083966]    
 ]

But clearly this logic is buggy and it won’t work for following scenario:

subSetScore = [
    ['A', 'C', 0.143827143333704],
    ['A', 'G', 0.1438310043614169],
    ['D', 'F', 0.15684652878164498],
    ['B', 'H', 0.1568851390587741],
    ['A', 'H', 0.44111469414482873],
    ['A', 'F', 0.44121508086536443],
    ['A', 'E', 0.4441224347331875],
    ['A', 'B', 0.4465394380814708],
    ['A', 'D', 0.4465394380814708],
    ['D', 'H', 0.7595452327118624],
    ['B', 'F', 0.7596456194323981],
    ['B', 'E', 0.7625529733002212],
    ['D', 'E', 0.7625529733002212],
    ['B', 'C', 0.7635645625610041],
    ['B', 'G', 0.763661088253827],
    ['D', 'G', 0.763661088253827],
    ['B', 'D', 0.7649699766485044],
    ['C', 'G', 0.7891593152699012],
    ['G', 'H', 1.0785858136575361],
    ['C', 'H', 1.0909217972002916],
    ['C', 'F', 1.0910221839208274],
    ['C', 'E', 1.0939295377886504],
    ['C', 'D', 1.0963465411369335],
    ['E', 'H', 1.3717343427604187],
    ['E', 'F', 1.3718347294809543],
    ['E', 'G', 1.3758501983023834],
    ['F', 'H', 2.0468234552800326],
    ['F', 'G', 2.050939310821997]
]

As the cutoff would be 2, here is what closest would look like:

closest = [
    ['A', 'C', 0.143827143333704],
    ['A', 'G', 0.1438310043614169],
    ['D', 'F', 0.15684652878164498]
]

But here is the expected result:

closest = [
    ['A', 'C', 0.143827143333704],
    ['A', 'G', 0.1438310043614169],
    ['D', 'F', 0.15684652878164498],
    ['B', 'H', 0.1568851390587741]
]

More datasets:

subSetScore1 = [
   ['A', 'C', 0.22406316023573888],
   ['A', 'G', 0.22407088229116476],
   ['D', 'F', 0.30378179942424355],
   ['B', 'H', 0.3127393837182006],
   ['A', 'F', 0.4947366470217576],
   ['A', 'H', 0.49582931786451195],
   ['A', 'E', 0.5249800770970015],
   ['A', 'B', 0.6132933639744492],
   ['A', 'D', 0.6164207964219085],
   ['D', 'H', 0.8856811470650012],
   ['B', 'F', 0.8870402288199465],
   ['D', 'E', 0.916716087821392],
   ['B', 'E', 0.929515394689697],
   ['B', 'C', 1.0224773589334915],
   ['D', 'G', 1.0252457158036496],
   ['B', 'G', 1.0815974152736079],
   ['B', 'D', 1.116948985013035],
   ['G', 'H', 1.1663971669323054],
   ['C', 'F', 1.1671269011700458],
   ['C', 'G', 1.202339473911808],
   ['C', 'H', 1.28446739439317],
   ['C', 'E', 1.4222597514115916],
   ['E', 'F', 1.537160075120155],
   ['E', 'H', 1.5428705351075527],
   ['C', 'D', 1.6198555666753154],
   ['E', 'G', 1.964274682777963],
   ['F', 'H', 2.3095586690883034],
   ['F', 'G', 2.6867154391687365]
]

subSetScore2 = [
   ['A', 'H', 0.22812496138972285],
   ['A', 'C', 0.23015200093900193],
   ['A', 'B', 0.2321751794605681],
   ['A', 'G', 0.23302074452969593],
   ['A', 'D', 0.23360762074205865],
   ['A', 'F', 0.24534900601702558],
   ['A', 'E', 0.24730268603975933],
   ['B', 'F', 0.24968107911091342],
   ['B', 'E', 0.2516347591336472],
   ['B', 'H', 0.2535228016852614],
   ['B', 'C', 0.25554984123454044],
   ['C', 'F', 0.2766387746024686],
   ['G', 'H', 0.2767739105724205],
   ['D', 'F', 0.2855654706747223],
   ['D', 'E', 0.28751915069745604],
   ['D', 'G', 0.30469686299220383],
   ['D', 'H', 0.30884360675587186],
   ['E', 'F', 0.31103280946909323],
   ['E', 'H', 0.33070474566638247],
   ['B', 'G', 0.7301435066780336],
   ['B', 'D', 0.7473019138342167],
   ['C', 'E', 0.749630113545103],
   ['C', 'H', 0.7515104340412913],
   ['F', 'H', 0.8092791306818884],
   ['E', 'G', 0.8506307374871814],
   ['C', 'G', 1.2281311390340637],
   ['C', 'D', 1.2454208211324858],
   ['F', 'G', 1.3292051225026873]
]

subSetScore3 = [
   ['A', 'F', 0.06947533266614773],
   ['B', 'F', 0.06947533266614773],
   ['C', 'F', 0.06947533266614773],
   ['D', 'F', 0.06947533266614773],
   ['E', 'F', 0.06947533266614773],
   ['A', 'H', 0.07006993093393628],
   ['B', 'H', 0.07006993093393628],
   ['D', 'H', 0.07006993093393628],
   ['E', 'H', 0.07006993093393628],
   ['G', 'H', 0.07006993093393628],
   ['A', 'E', 0.09015499709650715],
   ['B', 'E', 0.09015499709650715],
   ['D', 'E', 0.09015499709650715],
   ['A', 'C', 0.10039444259115113],
   ['A', 'G', 0.10039444259115113],
   ['B', 'C', 0.10039444259115113],
   ['D', 'G', 0.10039444259115113],
   ['A', 'D', 0.1104369756724366],
   ['A', 'B', 0.11063388808579513],
   ['B', 'G', 2.6511978452376543],
   ['B', 'D', 2.6612403783189396],
   ['C', 'H', 2.670889086573508],
   ['C', 'E', 2.690974152736078],
   ['C', 'G', 5.252017000877225],
   ['E', 'G', 5.252017000877225],
   ['C', 'D', 5.262059533958511],
   ['F', 'H', 5.322704696245228],
   ['F', 'G', 10.504651766188518]
]

How should I fix it, without using any library, except NumPy and SciPy?

Please note: I am on Python 2.7, and any library which comes as a part of Python (e.g. itertools, operator, math etc.)could be used.

Update: I can use SciPy, and not sure what would be effect of no of clusters, so I think 2 might suffice, but I am not an expert on cluster by any means, please feel free to advice, I appreciate it!

ablaze
  • 722
  • 7
  • 30
  • 3
    How do you determine if two elements are close? Is it only the number? Have you tried k-means clustering? – Ohumeronen Jul 20 '16 at 05:25
  • 2
    I suggest you look at clustering algorithms and how to implement them using Numpy. Clustering algorithms basically group items which are similar to each other based on a criteria you specify. There are a lot of clustering algorithms - K-means being one of the more popular ones. (Note: If you're willing to use scipy library, you'll find k-means already implemented as a function and you just need to make a call to it. It'll save you from the trouble of re-inventing the wheel.) – ChaoticTwist Jul 20 '16 at 05:31
  • 2
    Even Kmeans won't be enough as the problem seems to be very fuzzy. Dont wanna think too hard on this but taking percentage differences and trying to model their distribution to spot anomalies might help automatically identify the differences – Vivek Kalyanarangan Jul 20 '16 at 05:34
  • Great suggestion! I thought it was only possible via Pycluster – ablaze Jul 20 '16 at 05:35
  • To me the question is if he knows how many clusters there are or not? This would make the decision which algorithm to use more easy. I assumed he knows the number of clusters in my first comment. – Ohumeronen Jul 20 '16 at 05:36
  • 1
    Vivek, another great suggestion. But since this is dynamic data, X% of threshold for percent difference might make sense for one problem, but not for another. I can't seem to rule them with same stick. – ablaze Jul 20 '16 at 05:37
  • The expectation is just to find the close enough numbers. Looking ar k-cluster algorithm as we speak – ablaze Jul 20 '16 at 05:41
  • UPDATE : I can use SciPy, would you please shed some light on how could k means algorithm solve this problem? I have very basic understanding about it. Thanks! – ablaze Jul 21 '16 at 00:14

3 Answers3

1

I provide you with some code which is based on https://codereview.stackexchange.com/questions/80050/k-means-clustering-algorithm-in-python:

# kmeans clustering algorithm
# data = set of data points
# k = number of clusters
# c = initial list of centroids (if provided)
#
def kmeans(data, k, c):
    centroids = []
    centroids = randomize_centroids(data, centroids, k)  
    old_centroids = [[] for i in range(k)] 
    iterations = 0
    while not (has_converged(centroids, old_centroids, iterations)):
        iterations += 1
        clusters = [[] for i in range(k)]
        # assign data points to clusters
        clusters = euclidean_dist(data, centroids, clusters)
        # recalculate centroids
        index = 0
        for cluster in clusters:
            old_centroids[index] = centroids[index]
            centroids[index] = np.mean(cluster, axis=0).tolist()
            index += 1
    print("The total number of data instances is: " + str(len(data)))
    print("The total number of iterations necessary is: " + str(iterations))
    print("The means of each cluster are: " + str(centroids))
    print("The clusters are as follows:")
    for cluster in clusters:
        print("Cluster with a size of " + str(len(cluster)) + " starts here:")
        print(np.array(cluster).tolist())
        print("Cluster ends here.")
    return

# Calculates euclidean distance between
# a data point and all the available cluster
# centroids.      
def euclidean_dist(data, centroids, clusters):
    for instance in data:  
        # Find which centroid is the closest
        # to the given data point.
        mu_index = min([(i[0], np.linalg.norm(instance-centroids[i[0]])) \
                            for i in enumerate(centroids)], key=lambda t:t[1])[0]
        try:
            clusters[mu_index].append(instance)
        except KeyError:
            clusters[mu_index] = [instance]
    # If any cluster is empty then assign one point
    # from data set randomly so as to not have empty
    # clusters and 0 means.        
    for cluster in clusters:
        if not cluster:
            cluster.append(data[np.random.randint(0, len(data), size=1)])
    return clusters

# randomize initial centroids
def randomize_centroids(data, centroids, k):
    for cluster in range(0, k):
        centroids.append(data[np.random.randint(0, len(data), size=1)])
    return centroids

# check if clusters have converged    
def has_converged(centroids, old_centroids, iterations):
    MAX_ITERATIONS = 1000
    if iterations > MAX_ITERATIONS:
        return True
    return old_centroids == centroids

###############################################################################
# STARTING COMPUTATION                                                        #
###############################################################################
A = [1.1, 1.02, 2.3, 10, 10.01, 10.1, 12, 16, 18, 18]
B = [1.01, 1.02, 1.001, 1.03, 2.10, 2.94, 3.01, 8.99]
T = [A,B]
k = 3
for t in T:
    cent = np.random.permutation(t)[0:3]
    print kmeans(t, k, cent)
    print 

You will have to determine a value k which is the number of chunks into which your data will be split. The code splits the two arrays A and B which you provided into 3 chunks. You will have to decide: Either you set a fix number of chunks or you set a fix threshold.

You should also know that kmeans is a random based algorithm which does not always (but quite often) yield the best result. Therefore it might be a good idea to run it multiple times and averaging over the results.

Here is my favourite introduction into kmeans clustering by Sebastian Thrun :-)

https://www.youtube.com/watch?v=zaKjh2N8jN4&index=15&list=PL34DBDAC077F8F90D

Does this help you? This should allow you to develop your own version of kmeans which suits your needs. Is it ok for you to set a fix value of k? You did not yet answer this question.

EDIT: Based on Kmeans without knowing the number of clusters? I might also come up with a solution with a dynamic value of k if this solution is not yet good enough.

Community
  • 1
  • 1
Ohumeronen
  • 1,769
  • 2
  • 14
  • 28
  • Wish I could give you multiple upvotes. One for your great help, and yes my current problem is actually about solving an AI problem, And, Sebastian Thrun is awesome ! He also was behind "stanley" car for DARPA challenge – ablaze Jul 20 '16 at 17:05
  • Update, it turns out, I can actually use SciPy! Updated OP. – ablaze Jul 21 '16 at 00:00
1

Special thanks to Ohumeronen for great help, but I actually ended up trying another heuristics in the search for threshold less solution. So in the comparison below, if I have same alphabets on the first and second position, for the same index then those are considered to be relevant. However, this strategy is not full proof, I did see one failure, but the culprit found to be bad data, upon further investigation. So far I am getting some success, but more tests would get me better understanding.

matches = []
for index in range(len(subSetIntersectScore)):

     if subSetIntersectScore[index][0:2] == subSetUnionScore[index][0:2] or (index + 1< len(subSetIntersectScore) and subSetIntersectScore[index][0:2] == subSetUnionScore[index+1][0:2]):
         matches.append(subSetIntersectScore[index][0:2])

     elif index > 0 and subSetIntersectScore[index][0:2] == subSetUnionScore[index - 1][0:2]:
         matches.append(subSetIntersectScore[index][0:2])

     else:
         break

Positive Outcomes

enter image description here Match : [(F, H), (C, E), (C, G), (E, G), (B, D)]

enter image description here Match [(A, C), (A, G), (D, F), (B, H)]

enter image description here

Negative Outcomes enter image description here

ablaze
  • 722
  • 7
  • 30
0

Please check out this code:

t1 = [0.12,0.24,0.24,0.34,0.47,0.91,0.91,0.91,0.91,0.94,0.94,0.94,1.32,1.32,1.32,1.32,1.32,1.32,1.34,1.35,1.35,1.71,1.72,1.72,1.72,1.74,1.76,2.15]
t2 = [0.22,0.22,0.30,0.31,0.49,0.49,0.52,0.61,0.61,0.88,0.88,0.91,0.92,1.02,1.02,1.08,1.11,1.16,1.16,1.20,1.28,1.42,1.53,1.54,1.61,1.96,2.30,2.68]
t3 = [0.22,0.23,0.23,0.23,0.23,0.24,0.24,0.24,0.25,0.25,0.25,0.27,0.27,0.28,0.28,0.30,0.30,0.31,0.33,0.73,0.74,0.74,0.75,0.80,0.85,1.22,1.24,1.32]
t4 = [0.06,0.06,0.06,0.06,0.06,0.07,0.07,0.07,0.07,0.07,0.09,0.09,0.09,0.10,0.10,0.10,0.10,0.11,0.11,2.65,2.66,2.67,2.69,5.25,5.25,5.26,5.32,0.50]
ts = [t1,t2,t3,t4]
threshold = 0.5 # 0.3 worked a bit better
for t in ts:
    abs_differences = [abs(t[idx]-t[idx+1]) for idx in range(len(t)-1)]
    # remove all elements after cut off index
    cutOffIndex = [p > max(abs_differences) * threshold for p in abs_differences].index(True)
    # Print index + values.
    print zip(t,[p > max(abs_differences) * threshold for p in abs_differences])
    # Print only indices.
    # print [p > max(abs_differences) * threshold for p in abs_differences]

This allows you to determine the indices where your signal level changes. You can adjust the threshold for the differences with threshold which is the percentage of the maximum possible signal change.

ablaze
  • 722
  • 7
  • 30
Ohumeronen
  • 1,769
  • 2
  • 14
  • 28
  • As mentioned in the comments in the OP, hard coding threshold doesn't make sense. Sorry. – ablaze Jul 20 '16 at 05:50
  • So how would you determine close enough as you stated above? I gladly improve my suggestion if you can tell me :-) Do you know the number of clusters? – Ohumeronen Jul 20 '16 at 05:52
  • I don't understand why this does not help you. Assuming you don't know the number of clusters, this could help you if you adjust the threshold to a value that suits your needs. The threshold is relative to the maximum percentage change in your signal so it should be generally applicably to other signals as well... – Ohumeronen Jul 20 '16 at 06:05
  • The reason is, the example of 2D list above is just one data set, I'm gonna have to do the same for the datasets I've not seen, so how do I know in first place to tune the threshold like you suggested? Believe me, I'm already using something very similar where I can. – ablaze Jul 20 '16 at 06:13
  • The number of clusters is the number of chunks in your data/signal. Clustering algorithms can make sense if you know this number. The threshold as I defined it is not a fix number but adapts to your dataset since it is relative to the maximum possible signal change in one dataset. Maybe the code is very concise but it adapts to other datasets as well. I find this topic very interesting. Can you update your post and provide another dataset as well? I would like to have a look. – Ohumeronen Jul 20 '16 at 06:21
  • Sure. I'd add three more datasets. (please don't wait, I'd make sure to post and add a comment here so that way you'd get a notification) – ablaze Jul 20 '16 at 06:25
  • Added more datasets, as requested. – ablaze Jul 20 '16 at 06:54
  • Also, I plan to run your code in loop about 100 times, while looping through each dataset in an outer loop. – ablaze Jul 20 '16 at 07:08
  • I am checking my solution on your newly added data. Looping through all the datasets makes sense. – Ohumeronen Jul 20 '16 at 07:47
  • I updated my solution. You can play around with the threshold and find out what suits best. I also printed the values as tuples together with a boolean to indicate for each value if the next value will be significantly different. I found a value 0.15 <= threshold <= 0.5 to be adequate. Does this help you? If not, how would the desired output for the 4 datasets look like? – Ohumeronen Jul 20 '16 at 08:09
  • In case you give me the desired output, can you give it based on the code I provided and in the form of an array of booleans? That would realy help me to understand. But as I said, just if necessary. – Ohumeronen Jul 20 '16 at 08:31
  • Please see the revised question, hope it helps. I do appreciate your help so far ! – ablaze Jul 20 '16 at 13:30
  • Approved the edit but with .index(True) you only have one single cut. This gives you 2 chunks/clusters of data. Maybe what you wanted was a little bit simpler than I thought. – Ohumeronen Jul 21 '16 at 13:56
  • 1
    I appreciate it! Yes, only the first occurrence and then I update the list as x=x[0:cutOffIndex+1] – ablaze Jul 21 '16 at 14:09