First off, this is not a duplicate post; my question is different than the ones I searched on this site, but please feel free to link if you find an already answered question
Description:
If you think how your own mind finds out 10 and 2.10 to be first element which is not close, in A and B below, that is what I am trying to do programmatically. A hard coded threshold value is not the best option. Of-course, we need a threshold here, but the function should find the threshold based on values provided, so in the case of A, threshold might be around 1.1, and 0.01 for B. How? Well, "it makes sense" right? We looked at the values and figured out. That is what I meant, "dynamic threshold" per se, if your answer includes using threshold.
A = [1.1, 1.02, 2.3, 10, 10.01, 10.1, 12, 16, 18, 18]
B = [1.01, 1.02, 1.001, 1.03, 2.10, 2.94, 3.01, 8.99]
Python Problem:
I have 2D list in Python which looks like below, now if want to narrow down the items which are closer, to each other starting only from top to bottom ( the list is already sorted as you can notice ), we can easily find out that first four are quite closer to each other than 4th and 5th are.
subSetScore = [
['F', 'H', 0.12346022214809049],
['C', 'E', 0.24674283702138702],
['C', 'G', 0.24675055907681284],
['E', 'G', 0.3467125665641178],
['B', 'D', 0.4720531092083966],
['A', 'H', 0.9157739970594413],
['A', 'C', 0.9173801845880128],
['A', 'G', 0.9174496830868454],
['A', 'B', 0.918924595673178],
['A', 'F', 0.9403919097569715],
['A', 'E', 0.9419672090638398],
['A', 'D', 0.9436390340635308],
['B', 'H', 1.3237456293166292],
['D', 'H', 1.3237456293166292],
['D', 'F', 1.3238460160371646],
['B', 'C', 1.3253518168452008],
['D', 'E', 1.325421315344033],
['D', 'G', 1.325421315344033],
['B', 'F', 1.349344243053239],
['B', 'E', 1.350919542360107],
['B', 'G', 1.350919542360107],
['C', 'H', 1.7160260449485403],
['E', 'H', 1.7238716532611786],
['G', 'H', 1.7238716532611786],
['E', 'F', 1.7239720399817142],
['C', 'F', 1.7416246586851503],
['C', 'D', 1.769389308968704],
['F', 'G', 2.1501908892101267]
]
Result:
closest = [
['F', 'H', 0.12346022214809049],
['C', 'E', 0.24674283702138702],
['C', 'G', 0.24675055907681284],
['E', 'G', 0.3467125665641178],
['B', 'D', 0.4720531092083966]
]
As opposite to other questions I have observed here, where the 1D or 2D list is given and an arbitrary value let’s say 0.9536795380033108, then the function has to find that 0.9436390340635308 is the closest from the list, and the mostly the solutions use absolute difference to calculate it, but it does not seem to be applicable here.
One approach which seem to be partially reliable was to calculate cumulative difference, as following.
consecutiveDifferences = []
for index, item in enumerate(subSetScore):
if index == 0:
continue
consecutiveDifferences.append([index, subSetScore[index][2] - subSetScore[index - 1][2]])
This gives me following:
consecutiveDifferences = [
[1, 0.12328261487329653],
[2, 7.722055425818386e-06],
[3, 0.09996200748730497],
[4, 0.1253405426442788],
[5, 0.4437208878510447],
[6, 0.0016061875285715566],
[7, 6.949849883253201e-05],
[8, 0.0014749125863325885],
[9, 0.021467314083793543],
[10, 0.001575299306868283],
[11, 0.001671824999690985],
[12, 0.3801065952530984],
[13, 0.0],
[14, 0.00010038672053536146],
[15, 0.001505800808036195],
[16, 6.949849883230996e-05],
[17, 0.0],
[18, 0.0239229277092059],
[19, 0.001575299306868061],
[20, 0.0],
[21, 0.36510650258843325],
[22, 0.007845608312638364],
[23, 0.0],
[24, 0.00010038672053558351],
[25, 0.01765261870343604],
[26, 0.027764650283553793],
[27, 0.38080158024142263]
]
And now, the index of difference more than the difference on 0th index is my cutoff index as following:
cutoff = -1
for index, item in enumerate(consecutiveDifferences):
if index == 0:
continue
if consecutiveDifferences[index][1] > consecutiveDifferences[0][1]:
cutoff = index
break
cutoff = cutoff+1
closest = subSetScore[:cutoff+1]
Which would leave my list (closest) as following:
consecutiveDifferences = [
['F', 'H', 0.12346022214809049],
['C', 'E', 0.24674283702138702],
['C', 'G', 0.24675055907681284],
['E', 'G', 0.3467125665641178],
['B', 'D', 0.4720531092083966]
]
But clearly this logic is buggy and it won’t work for following scenario:
subSetScore = [
['A', 'C', 0.143827143333704],
['A', 'G', 0.1438310043614169],
['D', 'F', 0.15684652878164498],
['B', 'H', 0.1568851390587741],
['A', 'H', 0.44111469414482873],
['A', 'F', 0.44121508086536443],
['A', 'E', 0.4441224347331875],
['A', 'B', 0.4465394380814708],
['A', 'D', 0.4465394380814708],
['D', 'H', 0.7595452327118624],
['B', 'F', 0.7596456194323981],
['B', 'E', 0.7625529733002212],
['D', 'E', 0.7625529733002212],
['B', 'C', 0.7635645625610041],
['B', 'G', 0.763661088253827],
['D', 'G', 0.763661088253827],
['B', 'D', 0.7649699766485044],
['C', 'G', 0.7891593152699012],
['G', 'H', 1.0785858136575361],
['C', 'H', 1.0909217972002916],
['C', 'F', 1.0910221839208274],
['C', 'E', 1.0939295377886504],
['C', 'D', 1.0963465411369335],
['E', 'H', 1.3717343427604187],
['E', 'F', 1.3718347294809543],
['E', 'G', 1.3758501983023834],
['F', 'H', 2.0468234552800326],
['F', 'G', 2.050939310821997]
]
As the cutoff would be 2, here is what closest would look like:
closest = [
['A', 'C', 0.143827143333704],
['A', 'G', 0.1438310043614169],
['D', 'F', 0.15684652878164498]
]
But here is the expected result:
closest = [
['A', 'C', 0.143827143333704],
['A', 'G', 0.1438310043614169],
['D', 'F', 0.15684652878164498],
['B', 'H', 0.1568851390587741]
]
More datasets:
subSetScore1 = [
['A', 'C', 0.22406316023573888],
['A', 'G', 0.22407088229116476],
['D', 'F', 0.30378179942424355],
['B', 'H', 0.3127393837182006],
['A', 'F', 0.4947366470217576],
['A', 'H', 0.49582931786451195],
['A', 'E', 0.5249800770970015],
['A', 'B', 0.6132933639744492],
['A', 'D', 0.6164207964219085],
['D', 'H', 0.8856811470650012],
['B', 'F', 0.8870402288199465],
['D', 'E', 0.916716087821392],
['B', 'E', 0.929515394689697],
['B', 'C', 1.0224773589334915],
['D', 'G', 1.0252457158036496],
['B', 'G', 1.0815974152736079],
['B', 'D', 1.116948985013035],
['G', 'H', 1.1663971669323054],
['C', 'F', 1.1671269011700458],
['C', 'G', 1.202339473911808],
['C', 'H', 1.28446739439317],
['C', 'E', 1.4222597514115916],
['E', 'F', 1.537160075120155],
['E', 'H', 1.5428705351075527],
['C', 'D', 1.6198555666753154],
['E', 'G', 1.964274682777963],
['F', 'H', 2.3095586690883034],
['F', 'G', 2.6867154391687365]
]
subSetScore2 = [
['A', 'H', 0.22812496138972285],
['A', 'C', 0.23015200093900193],
['A', 'B', 0.2321751794605681],
['A', 'G', 0.23302074452969593],
['A', 'D', 0.23360762074205865],
['A', 'F', 0.24534900601702558],
['A', 'E', 0.24730268603975933],
['B', 'F', 0.24968107911091342],
['B', 'E', 0.2516347591336472],
['B', 'H', 0.2535228016852614],
['B', 'C', 0.25554984123454044],
['C', 'F', 0.2766387746024686],
['G', 'H', 0.2767739105724205],
['D', 'F', 0.2855654706747223],
['D', 'E', 0.28751915069745604],
['D', 'G', 0.30469686299220383],
['D', 'H', 0.30884360675587186],
['E', 'F', 0.31103280946909323],
['E', 'H', 0.33070474566638247],
['B', 'G', 0.7301435066780336],
['B', 'D', 0.7473019138342167],
['C', 'E', 0.749630113545103],
['C', 'H', 0.7515104340412913],
['F', 'H', 0.8092791306818884],
['E', 'G', 0.8506307374871814],
['C', 'G', 1.2281311390340637],
['C', 'D', 1.2454208211324858],
['F', 'G', 1.3292051225026873]
]
subSetScore3 = [
['A', 'F', 0.06947533266614773],
['B', 'F', 0.06947533266614773],
['C', 'F', 0.06947533266614773],
['D', 'F', 0.06947533266614773],
['E', 'F', 0.06947533266614773],
['A', 'H', 0.07006993093393628],
['B', 'H', 0.07006993093393628],
['D', 'H', 0.07006993093393628],
['E', 'H', 0.07006993093393628],
['G', 'H', 0.07006993093393628],
['A', 'E', 0.09015499709650715],
['B', 'E', 0.09015499709650715],
['D', 'E', 0.09015499709650715],
['A', 'C', 0.10039444259115113],
['A', 'G', 0.10039444259115113],
['B', 'C', 0.10039444259115113],
['D', 'G', 0.10039444259115113],
['A', 'D', 0.1104369756724366],
['A', 'B', 0.11063388808579513],
['B', 'G', 2.6511978452376543],
['B', 'D', 2.6612403783189396],
['C', 'H', 2.670889086573508],
['C', 'E', 2.690974152736078],
['C', 'G', 5.252017000877225],
['E', 'G', 5.252017000877225],
['C', 'D', 5.262059533958511],
['F', 'H', 5.322704696245228],
['F', 'G', 10.504651766188518]
]
How should I fix it, without using any library, except NumPy and SciPy?
Please note: I am on Python 2.7, and any library which comes as a part of Python (e.g. itertools, operator, math etc.)could be used.
Update: I can use SciPy, and not sure what would be effect of no of clusters, so I think 2 might suffice, but I am not an expert on cluster by any means, please feel free to advice, I appreciate it!