5

I seem to be running into a bug with Python's built-in sorted function. I can't seem to find it documented or described anywhere, but I'm also getting results that seem to be inherently impossible based on my understanding of Python so I'm hoping that someone can help, or at least point me in the right direction.

To start, I've got a long list of floats (1.4m to be exact). They are all very small, and some of them are very near each other. In fact when cast to a set the length is only ~240k.

My goal is simple: I need to find a cutoff, such that only n values are above this cutoff, seems simple. Here is my current code:

sorted_values = sorted(all_values)
top_250 = [item for item in sorted_values if item < sorted_values[250]]
len(top_250)  # 5828

Unfortunately the data is proprietary (and quite large) so I cannot share it, but hopefully I've given enough detail above. The list is a list of all floats, all between 0 and 1, with many duplicates. I cannot for the life of me figure out what's going on here.

My best guess is that the sorted comparator operates differently from a traditional < in Python. Perhaps the < has an implicit epsilon associated with it that I'm not aware of, but I can't find anything similar in the Python docs, so I'm really at a dead end.

If anyone has even the vaguest sense of what this might be I would be tremendously grateful.

Update: If I replace all_values above with list(set(all_values)) then this problem disappears and sorted performs as expected. This does not however solve my problem, nor does it explain why sorted is acting differently than documented. Based on a very large number of attempts (and failures) at reproducing this locally, this seems to happen specifically with floats at large data sizes.

Update 2: Data below, and in this gist: https://gist.github.com/Slater-Victoroff/bc03fb39e07a30caf57460330d02d3f7

[4.8502882289326867e-14, 8.820403952763587e-14, 1.0250590153123516e-13, 1.6954166246183968e-13, 1.908753789966549e-13, 2.004836996365545e-13, 3.5909784960471013e-13, 5.166693869401628e-13, 7.550572170701337e-13, 9.448695131203987e-13, 1.0156517689399418e-12, 1.1252632977953388e-12, 1.2755835948385307e-12, 1.718926371953829e-12, 5.561410412067812e-12, 5.612637550150243e-12, 8.453230126630751e-12, 8.725636132121303e-12, 1.613513316811227e-11, 1.9752204589154072e-11, 3.992415416880054e-11, 5.064793173473998e-11, 6.907684718699313e-11, 7.573645788325202e-11, 1.9359271945838294e-10, 2.5869603101103545e-10, 3.2604244542604767e-10, 6.107383625965949e-10, 3.054349088416707e-09, 3.948608349010387e-09, 4.447405216487758e-09, 1.0212339772102215e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.7445705189856901e-06, 2.288475481497035e-06, 2.4871339408837677e-06, 4.784661357565223e-06, 7.735179302262651e-06, 7.735179302262651e-06, 7.735179302262651e-06, 1.143879077437328e-05, 1.7526502766236366e-05, 2.1197709067643072e-05, 2.6430911436535815e-05, 2.657094056627476e-05, 2.887131448340355e-05, 3.0230001662261698e-05, 3.0230001662261698e-05, 3.0230001662261698e-05, 3.0230001662261698e-05, 3.0230001662261698e-05, 3.4863008760479634e-05, 3.492397816990208e-05, 7.053323604609976e-05, 8.177663548884549e-05, 0.00010292414518908584, 0.00011010458218851394, 0.0001394823618647422, 0.00019911692003617067, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00036828214026394526, 0.00040585244638543746, 0.00044624278933638414, 0.000624263866548228, 0.0006433262237973916, 0.0007951156865591151, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0014101558138724423, 0.0014101558138724423, 0.0016218187650366893, 0.0018435395778915382, 0.00222143411631358, 0.002322745213235068, 0.002322745213235068, 0.0028678013067590753, 0.0028678013067590753, 0.00311777897086883, 0.0033861158375087397, 0.0033861158375087397, 0.0033861158375087397, 0.0033861158375087397, 0.0036513779740440966, 0.0036513779740440966, 0.0036513779740440966, 0.003677317310588113, 0.003677317310588113, 0.003677317310588113, 0.003677317310588113, 0.003677317310588113, 0.003677317310588113, 0.003677317310588113, 0.00475142897834944, 0.00475142897834944, 0.005193932513124365, 0.005337293972714926, 0.005337293972714926, 0.005337293972714926, 0.005377687077724396, 0.005377687077724396, 0.005377687077724396, 0.005377687077724396, 0.005377687077724396, 0.005377687077724396, 0.007409574659909205, 0.009752895879302663, 0.009752895879302663, 0.009752895879302663, 0.01039795547744385, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011352761081333766, 0.01220266897077021, 0.01220266897077021, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01329267875817111, 0.014505355414192591, 0.014505355414192591, 0.014505355414192591, 0.01524607053576118, 0.01524607053576118, 0.01524607053576118, 0.01524607053576118, 0.01524607053576118, 0.01524607053576118, 0.015404915329058346, 0.015404915329058346, 0.015404915329058346, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.018660404569681696, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.02318941202441517, 0.02318941202441517, 0.024206613952731663, 0.030469418869701087, 0.031649468600811484, 0.031649468600811484, 0.031649468600811484, 0.031649468600811484, 0.031649468600811484, 0.031649468600811484, 0.031649468600811484, 0.03421567685770522, 0.044982539971228225, 0.048298592091885494, 0.048298592091885494, 0.048298592091885494, 0.04891569108043106, 0.04891569108043106, 0.049897804257083, 0.049897804257083, 0.05467242947627652, 0.05467242947627652, 0.05481710086367935, 0.055089155307102226, 0.055089155307102226, 0.05509817970492246, 0.05677377998146615, 0.06442952433704384, 0.06442952433704384, 0.06442952433704384, 0.06649111613211628, 0.08749780016064151, 0.08749780016064151, nan, 9.281139595433629e-14, 2.269349834260423e-13, 2.7902576028316086e-13, 4.218835214235241e-13, 4.5077086681763955e-13, 4.82464670143429e-13, 4.842603006971216e-13, 5.193479028606108e-13, 6.068084596209685e-13, 6.610030519110495e-13, 7.212479024605606e-13, 8.568857509989623e-13, 1.6272513763945853e-12, 1.9674709158152676e-12, 2.1372581686426213e-12, 2.254528174258105e-12, 2.2876757671708803e-12, 2.4025583918652713e-12, 2.4241116067787184e-12, 3.0036146073251357e-12, 3.114057776360651e-12, 3.190714414399256e-12, 3.5097864963972432e-12, 3.672330803019696e-12, 3.7398481401433984e-12, 3.990818765183174e-12, 4.365123487983834e-12, 4.8888554308448594e-12, 5.387012900499014e-12, 7.96843236064306e-12, 8.077400370649127e-12, 1.1251193515434908e-11, 1.3675744745562445e-11, 1.3920614660091849e-11, 1.4087308217430403e-11, 1.4273596897235296e-11, 1.461920440928664e-11, 1.5634002646418687e-11, 2.3968977128922425e-11, 2.433923659511027e-11, 2.660223144487514e-11, 2.6664588260367057e-11, 2.676534530323327e-11, 2.8357052592197995e-11, 3.5482465948075545e-11, 3.912823701897628e-11, 4.5792515620344005e-11, 4.83302598018717e-11, 5.7496911875158286e-11, 7.393414940211248e-11, 7.537810324630735e-11, 1.0467186667268884e-10, 1.1561730928248412e-10, 1.2311221790850263e-10, 1.4904172713013177e-10, 1.9074109748783146e-10, 2.0111746088624902e-10, 2.2283480985401337e-10, 2.357187317491253e-10, 2.843497025582179e-10, 2.9354396217842974e-10, 3.776793074685755e-10, 3.9930283126020015e-10, 4.873977459556436e-10, 6.136966626402195e-10, 7.918346129110473e-10, 8.089460836160799e-10, 8.434394278175816e-10, 1.232965109215602e-09, 1.5142285361893607e-09, 1.6328557314948404e-09, 2.035771709232225e-09, 2.2619806140213693e-09, 2.6289372753845552e-09, 3.0025433124197725e-09, 3.035021500823481e-09, 3.206258341498779e-09, 3.3000409154642987e-09, 8.413437663248949e-09, 9.618715278570361e-09, 1.1016869472694786e-08, 1.900790439141643e-08, 2.311826318521997e-08, 3.256531210029739e-08, 7.062069115372226e-08, 1.0379292719811166e-07, 3.5715917986665096e-07, 3.6750233373760955e-07, 6.843164520980078e-07, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3008598108779382e-06, 1.3460252150945272e-06, 1.8929692383974172e-06, 7.735179302262651e-06, 7.735179302262651e-06, 7.735179302262651e-06, 7.735179302262651e-06, 7.735179302262651e-06, 7.735179302262651e-06, 7.735179302262651e-06, 7.735179302262651e-06, 7.735179302262651e-06, 8.411728545722206e-06, 8.778941800484344e-06, 8.942313844705377e-06, 1.4014148371009287e-05, 2.2926951485335274e-05, 2.2926951485335274e-05, 2.431593662720157e-05, 2.4932490260874857e-05, 2.7271518164709887e-05, 2.7313174993637146e-05, 2.9232793694637693e-05, 3.0230001662261698e-05, 3.0230001662261698e-05, 3.0230001662261698e-05, 3.0230001662261698e-05, 3.0230001662261698e-05, 3.0230001662261698e-05, 3.0230001662261698e-05, 3.0230001662261698e-05, 3.0230001662261698e-05, 3.4278826533092836e-05, 3.4360528895444655e-05, 3.5152107842660994e-05, 3.5152107842660994e-05, 3.8734026594771645e-05, 4.565502139540429e-05, 4.597203836887115e-05, 4.822546291257643e-05, 4.8425264344652075e-05, 4.855200308535534e-05, 5.510805469377382e-05, 5.7112188449031483e-05, 6.092159391178477e-05, 6.879571179956485e-05, 8.313809937273007e-05, 0.00011371822284591567, 0.00012011351199907829, 0.0001335844793605124, 0.00013784572395372847, 0.0001394823618647422, 0.0001394823618647422, 0.0001394823618647422, 0.0001394823618647422, 0.0001394823618647422, 0.0001394823618647422, 0.0001394823618647422, 0.0001394823618647422, 0.00015987871167652716, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00021400555532215588, 0.00024385105098149426, 0.0002729346730477522, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.00030495456056691696, 0.0003550239242961073, 0.0003765027072067583, 0.0004240672096162282, 0.0004311923920911356, 0.0004392093202240639, 0.00046548915927821635, 0.00046667500410430354, 0.0004947685893786727, 0.0006369455430736078, 0.000838680562258869, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0011355323237735688, 0.0014101558138724423, 0.0014101558138724423, 0.0014101558138724423, 0.0014101558138724423, 0.0015552826431019483, 0.0016218187650366893, 0.0016218187650366893, 0.0016218187650366893, 0.0016218187650366893, 0.0018435395778915382, 0.0018435395778915382, 0.0018435395778915382, 0.0019042110442508626, 0.00222143411631358, 0.00222143411631358, 0.00222143411631358, 0.00222143411631358, 0.00222143411631358, 0.00222143411631358, 0.00222143411631358, 0.00222143411631358, 0.00222143411631358, 0.0022432887035898808, 0.002322745213235068, 0.002322745213235068, 0.002322745213235068, 0.002322745213235068, 0.002322745213235068, 0.002322745213235068, 0.002322745213235068, 0.002322745213235068, 0.002322745213235068, 0.002322745213235068, 0.0027235004253453373, 0.0028678013067590753, 0.0028678013067590753, 0.0028678013067590753, 0.0028678013067590753, 0.00311777897086883, 0.0033861158375087397, 0.0033861158375087397, 0.0033861158375087397, 0.0033861158375087397, 0.0033861158375087397, 0.0033861158375087397, 0.0033861158375087397, 0.0033861158375087397, 0.0033861158375087397, 0.0033861158375087397, 0.0033861158375087397, 0.003464625013107984, 0.003464625013107984, 0.003464625013107984, 0.0036048379414816956, 0.0036048379414816956, 0.0036513779740440966, 0.0036513779740440966, 0.0036513779740440966, 0.0036513779740440966, 0.003673258086633926, 0.003677317310588113, 0.003677317310588113, 0.003677317310588113, 0.003677317310588113, 0.003677317310588113, 0.003677317310588113, 0.003677317310588113, 0.003677317310588113, 0.003677317310588113, 0.003677317310588113, 0.004419966866807584, 0.00475142897834944, 0.00475142897834944, 0.00475142897834944, 0.00475142897834944, 0.00475142897834944, 0.00475142897834944, 0.00475142897834944, 0.00475142897834944, 0.00475142897834944, 0.00475142897834944, 0.004779170485667904, 0.005337293972714926, 0.005337293972714926, 0.005337293972714926, 0.005337293972714926, 0.005337293972714926, 0.005337293972714926, 0.005337293972714926, 0.005337293972714926, 0.005337293972714926, 0.005337293972714926, 0.005337293972714926, 0.005377687077724396, 0.005377687077724396, 0.005377687077724396, 0.005377687077724396, 0.005377687077724396, 0.005377687077724396, 0.005377687077724396, 0.005377687077724396, 0.005377687077724396, 0.005377687077724396, 0.005377687077724396, 0.005377687077724396, 0.008735137230944151, 0.009242277068721074, 0.009242277068721074, 0.009242277068721074, 0.009242277068721074, 0.009242277068721074, 0.009752895879302663, 0.009752895879302663, 0.009752895879302663, 0.009752895879302663, 0.009752895879302663, 0.009752895879302663, 0.01039795547744385, 0.010628779208547554, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.011221522124806933, 0.01220266897077021, 0.01220266897077021, 0.01220266897077021, 0.01220266897077021, 0.01220266897077021, 0.01220266897077021, 0.01220266897077021, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.01250688334305161, 0.014505355414192591, 0.014505355414192591, 0.014505355414192591, 0.014505355414192591, 0.014505355414192591, 0.014505355414192591, 0.014505355414192591, 0.014505355414192591, 0.01524607053576118, 0.01524607053576118, 0.01524607053576118, 0.01524607053576118, 0.01524607053576118, 0.01524607053576118, 0.01524607053576118, 0.01524607053576118, 0.01524607053576118, 0.01524607053576118, 0.01524607053576118, 0.01524607053576118, 0.01524607053576118, 0.01524607053576118, 0.01524607053576118, 0.015404915329058346, 0.015404915329058346, 0.015404915329058346, 0.015404915329058346, 0.015404915329058346, 0.015404915329058346, 0.015404915329058346, 0.015404915329058346, 0.015404915329058346, 0.015404915329058346, 0.015404915329058346, 0.015404915329058346, 0.015404915329058346, 0.015837213855906463, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.017643610569380134, 0.018660404569681696, 0.018660404569681696, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.020872950531874518, 0.02288815659018242, 0.02288815659018242, 0.02288815659018242, 0.02288815659018242, 0.02288815659018242, 0.02288815659018242, 0.02288815659018242, 0.02288815659018242, 0.02318941202441517, 0.02318941202441517, 0.02318941202441517, 0.02318941202441517, 0.02318941202441517, 0.025601966894170384, 0.02745793240099951, 0.02745793240099951, 0.030469418869701087, 0.030469418869701087, 0.030469418869701087, 0.030469418869701087, 0.031649468600811484, 0.031649468600811484, 0.031649468600811484, 0.031649468600811484, 0.031649468600811484, 0.031649468600811484, 0.031649468600811484, 0.031649468600811484, 0.031649468600811484, 0.031649468600811484, 0.031649468600811484, 0.031649468600811484, 0.031649468600811484, 0.031649468600811484, 0.031649468600811484, 0.031649468600811484, 0.031649468600811484, 0.031649468600811484, 0.031649468600811484, 0.031649468600811484, 0.03318429749581211, 0.034209255258829244, 0.046861045474881166, 0.048298592091885494, 0.048298592091885494, 0.048298592091885494, 0.048298592091885494, 0.048298592091885494, 0.048298592091885494, 0.048298592091885494, 0.048298592091885494, 0.048298592091885494, 0.048298592091885494, 0.048298592091885494, 0.048298592091885494, 0.04891569108043106, 0.04891569108043106, 0.04891569108043106, 0.04891569108043106, 0.04891569108043106, 0.04891569108043106, 0.04891569108043106, 0.04891569108043106, 0.04891569108043106, 0.04891569108043106, 0.04891569108043106, 0.04891569108043106, 0.04891569108043106, 0.04891569108043106, 0.04981588650755867, 0.049897804257083, 0.049897804257083, 0.049897804257083, 0.049897804257083, 0.049897804257083, 0.049897804257083, 0.049897804257083, 0.049897804257083, 0.049897804257083, 0.05081690247708189, 0.05467242947627652, 0.05467242947627652, 0.05467242947627652, 0.05467242947627652, 0.05467242947627652, 0.05467242947627652, 0.05481710086367935, 0.05481710086367935, 0.05481710086367935, 0.05485546341750821, 0.05485546341750821, 0.05485546341750821, 0.055089155307102226, 0.055089155307102226, 0.055089155307102226, 0.055089155307102226, 0.055089155307102226, 0.055089155307102226, 0.05509817970492246, 0.05509817970492246, 0.05509817970492246, 0.05509817970492246, 0.05509817970492246, 0.05509817970492246, 0.05509817970492246, 0.05509817970492246, 0.05677377998146615, 0.05677377998146615, 0.05677377998146615, 0.05677377998146615, 0.05677377998146615, 0.05677377998146615, 0.05677377998146615, 0.05677377998146615, 0.05703536944842313, 0.05703536944842313, 0.06442952433704384, 0.06442952433704384, 0.06442952433704384, 0.06442952433704384, 0.06442952433704384, 0.06442952433704384, 0.06442952433704384, 0.06442952433704384, 0.06442952433704384, 0.06442952433704384, 0.06442952433704384, 0.06442952433704384, 0.06442952433704384, 0.06442952433704384, 0.06442952433704384, 0.06442952433704384, 0.06649111613211628, 0.0688743619567419, 0.0688743619567419, 0.0688743619567419, 0.07067059510945926, 0.07067059510945926, 0.07067059510945926, 0.07067059510945926, 0.07067059510945926, 0.07067059510945926, 0.07786785923711323, 0.07786785923711323, 0.07786785923711323, 0.07786785923711323, 0.07786785923711323, 0.07786785923711323, 0.07786785923711323, 0.07917009879824155, 0.08693985693561855, 0.08693985693561855, 0.08696214430364411, 0.08749780016064151, 0.08749780016064151, 0.08749780016064151, 0.08749780016064151, 0.08749780016064151, 0.08749780016064151, 0.08749780016064151, 0.08749780016064151, 0.08749780016064151, 0.08749780016064151, 0.08749780016064151, 0.08749780016064151, 0.08749780016064151, 0.08749780016064151, 0.09127803768251525, 0.09127803768251525, 0.09638672921682619, 0.09769320936724614, 0.09769320936724614, 0.09769320936724614, 0.10557328883038568, 0.10774376329496757, 0.10774376329496757, 0.10774376329496757, 0.1084390408167554, 0.11822729528682707, 0.11822729528682707, 0.11822729528682707, 0.13535933255150795, 0.1383609827438404, 0.1383609827438404, 0.1383609827438404, 0.1383609827438404, 0.15114253416458057, 0.15853753519355132, 0.15853753519355132, 0.15853753519355132, 0.15853753519355132, 0.15853753519355132, 0.17588077798687632, 0.18864310137300858, 0.18864310137300858, 0.18864310137300858, 0.4043949148776759, 0.4685147721706657, 0.49579460910357864, 0.7469712201476048, 43.95161038164527]

Code to reproduce an error that is different, but similar enough that its resolution would also resolve my initial issue (likely):

test = json.load(open("github_gist.json"))
test = sorted(test)
print(len([item for item in test if item < test[250]]))  # 8
Slater Victoroff
  • 21,376
  • 21
  • 85
  • 144
  • 8
    You really need to create a [mcve]. Show us some sample float values that demonstrate the problem. You ought to be able to reproduce it with just a handful of values. – John Kugelman Jan 16 '18 at 00:17
  • @JohnKugelman I know, I've tried, but I've been unable to reproduce it with my initial attempts. I'm not looking for a full solution, I'm curious as to how literally any data could yield the results I'm seeing. Since I can't see any way for this behavior to arise, recreating the behavior, or diagnosing it is my primary bottleneck. – Slater Victoroff Jan 16 '18 at 00:20
  • @JohnKugelman If you have any suggestions as to methods of reproducing this, or even just investigating this I would be completely willing to do so. – Slater Victoroff Jan 16 '18 at 00:22
  • Would you see the same behavior after converting the list to a set and then back to a list? (Removing the duplicates.) – DYZ Jan 16 '18 at 00:22
  • Do you want to keep the duplicates? – srikavineehari Jan 16 '18 at 00:23
  • @srig so long as I get an accurate cutoff number maintaining the duplicates isn't important. – Slater Victoroff Jan 16 '18 at 00:25
  • @DYZ checking now. – Slater Victoroff Jan 16 '18 at 00:25
  • Have you considered `sorted_values = sorted(set(all_values))`? – srikavineehari Jan 16 '18 at 00:26
  • If `all_values` contains lots of duplicates then it's certainly possible for `top_250` to contain fewer than 250 items, but I can't think of how it would be possible for it to contain more than 250 items. Assuming you really are using floats, and not some class with a weird `__lt__` method. – PM 2Ring Jan 16 '18 at 00:28
  • @DYZ that seems to be the source of it. I don't hit this issue if I dedup first. Unfortunately it doesn't actually solve my problem. I still need the cutoff for which only 250 values are below that number (with duplicates included), but that seems to be the source of the error. Is there some implicit undocumented de-duping that happens within the sorted function that I'm not aware of? – Slater Victoroff Jan 16 '18 at 00:28
  • @PM2Ring Same, that's why I asked the question :P This behavior falls away with de-duplication, but that seems to be directly contradictory to the python docs – Slater Victoroff Jan 16 '18 at 00:29
  • @srig unfortunately getting the 250th unique value isn't what I need. I need the 250th value regardless of uniqueness. The duplicates aren't important in the list, but they are meaningful when it comes to using the output. – Slater Victoroff Jan 16 '18 at 00:30
  • 2
    "My best guess is that the sorted comparator operates differently from a traditional < in Python." No, it doesn't. `sorted` creates a list, fills it with the data, and then calls the `.sort` method of that list. The `.sort` method calls the `.__lt__` method of the objects it's sorting, just like using the `<` operator does. Python's TimSort is _extremely_ well-tested (and has been adopted by a couple of other languages), so I'm sure a bug like this would have been detected before now. – PM 2Ring Jan 16 '18 at 00:40
  • I don't believe you. So... what do you get for `type(all_values), set(map(type, all_values)), min(all_values), max(all_values), sorted_values[250]`? – Stefan Pochmann Jan 16 '18 at 00:42
  • I've tried to reproduce this problem using a list of random floats in the range 0..1, with lots of duplicates, and I get the expected results. I _can_ reproduce your problem by using weird objects that use this random `__lt__` method: `def __lt__(self, other): return random.random() < 0.5` – PM 2Ring Jan 16 '18 at 00:42
  • Please [edit] your question and provide a MCVE (see [How to create a Minimal, Complete, and Verifiable Example](https://stackoverflow.com/help/mcve)) that reproduces the problem using as _small_ a set of data as possible (which you should also provide). – martineau Jan 16 '18 at 00:43
  • Yeah, I too want to see the output of `set(map(type, all_values))` – juanpa.arrivillaga Jan 16 '18 at 00:43
  • I agree with @JohnKugelman: sample float values that demonstrate the problem would be helpful. – srikavineehari Jan 16 '18 at 00:45
  • @PM2Ring @juanpa.arrivillaga @StefanPochmann, `set(map(type, all_values))` only gives `set([])` I've anonymized the data, anyone know where I should host a 33M file? – Slater Victoroff Jan 16 '18 at 00:48
  • We don't want a 33MB file. We want you to reduce the dataset to something small enough to paste into your question. – PM 2Ring Jan 16 '18 at 00:49
  • I do want that file :-) – Stefan Pochmann Jan 16 '18 at 00:49
  • @PM2Ring I'm well aware. I am not certain that it is possible to do so. I have been trying to do so for several hours with no avail. – Slater Victoroff Jan 16 '18 at 00:50
  • @StefanPochmann any idea where I could put it? – Slater Victoroff Jan 16 '18 at 00:50
  • Can you use `random.sample` (or maybe `random.choice` in a list comp) to create a sublist of the data which still exhibits the weird behaviour? – PM 2Ring Jan 16 '18 at 00:52
  • dropbox? google drive? github? Choose your poison. – DSM Jan 16 '18 at 00:52
  • Don't know, it's not something I do. What format is it, btw? – Stefan Pochmann Jan 16 '18 at 00:52
  • @StefanPochmann Just a JSON dump. Putting it in a github gist, just a moment. – Slater Victoroff Jan 16 '18 at 00:56
  • @PM2Ring using random.sample changes the behavior significantly. The smallest I can drop it while maintaining this behavior (sort of) consistently is about 1k entries – Slater Victoroff Jan 16 '18 at 00:57
  • Can you please give us a list of 10 floats and explain what output you expect? – Darkonaut Jan 16 '18 at 00:59
  • Ok, I was hoping you could get it smaller, and select, say, the top 25, rather than the top 250. But 1000 floats isn't too large to put into a question. – PM 2Ring Jan 16 '18 at 01:01
  • @PM2Ring data added, adding more code, etc... – Slater Victoroff Jan 16 '18 at 01:06
  • So it's the `nan` values. – Stefan Pochmann Jan 16 '18 at 01:07
  • @StefanPochmann is it? There's very few of them and it's not clear to me that we'd get this kind of behavior. – Slater Victoroff Jan 16 '18 at 01:08
  • @Darkonaut the sorting isn't the problem. It's that the sorting isn't working as expected. The issue cannot be reproduced with 10 values, and it's not as simple as that. – Slater Victoroff Jan 16 '18 at 01:09
  • Comparing with `nan` results in `False`. It's neither smaller nor larger than any number. It's not even equal to itself! That's gotta mess sorting up. – Stefan Pochmann Jan 16 '18 at 01:12
  • Why are u using sorted instead of numpy's sort? Maybe sorted is just not fain grained enough? Numpy is build for such tasks so I'm wondering why you even try with sorted. – Darkonaut Jan 16 '18 at 01:12
  • @Darkonaut it's not particularly important. I'm reading in from json, and only running this once so the conversion overhead isn't worth it. Though, again, not material to the question. – Slater Victoroff Jan 16 '18 at 01:13
  • (a) Iterate through `sorted_values` comparing each item to the following item. If they all satisfy `<=`, you can stop speculating about the sort being wrong. (b) Save `sorted_values[250]` in `x`. Print it. Is it what you expect? (c) Iterate through `top_250` comparing each item to `x`. Is each one less than `x`? – Eric Postpischil Jan 16 '18 at 01:14
  • @StefanPochmann Fascinating so having a couple `nan`s doesn't raise any error, but is going to mean that when we grab the nth entry, though it will likely be very close to that, it's going to be off if a `nan` is used in a comparison somewhere. So in other words `sorted` silently errors and produces incorrect output if there is any nan in the dataset? – Slater Victoroff Jan 16 '18 at 01:16
  • @EricPostpischil Been through all that, none of those tests show the issue as they all behave as expected. Stefan figured it out see answer below. – Slater Victoroff Jan 16 '18 at 01:18
  • 3
    As an example of why examples are better than descriptions, you wrote "The list is a list of all floats, all between 0 and 1", which is false. – DSM Jan 16 '18 at 01:19
  • 1
    @SlaterTyranus: You could not have been through all that. If a NaN were in `sorted_values`, any comparison with it would not have satisfied `<=`. So my (a) would have revealed the problem. (Although incorrect code `if earlier > later then report error` would have failed to reveal the problem. The test needs to be `if earlier <= later then fine else report problem`.) – Eric Postpischil Jan 16 '18 at 01:19
  • @EricPostpischil I totally misread your (a), my mistake. I should have checked the sort. – Slater Victoroff Jan 16 '18 at 01:22
  • @DSM Apologies. All floats, apparently not all between 0 and 1. I believed this was the case, and inspected quite a lot of data to attempt to verify it. I assumed an error would have been thrown at some point where I asking it to do something on-sensible. Apparently I was wrong. – Slater Victoroff Jan 16 '18 at 01:23
  • `nan` does throw a spanner in the works. Eg, `print(sorted([0, 0.5, 0.75, float('nan'), 0.2, 1.0]))` prints `[0, 0.5, 0.75, nan, 0.2, 1.0]`. That's rather disturbing, IMHO. I didn't expect `nan` to interfere with the sorting of the other items. – PM 2Ring Jan 16 '18 at 01:24
  • @PM2Ring Super disturbing and extraordinarily hard to diagnose. Shame everyone just assumed I was making it up, downvoted me and then left. – Slater Victoroff Jan 16 '18 at 01:28
  • FWIW, we can reproduce your problem with `all_values = [0, 0.5, float('nan'), 0.2, 0.3, 0.4]; sorted_values = sorted(all_values); top = [item for item in sorted_values if item < sorted_values[1]]; print(top, len(top))` – PM 2Ring Jan 16 '18 at 01:35
  • @SlaterTyranus Well you also did get four upvotes. And you did misrepresent the situation, as the downvoters probably assumed. Also, it's not *that* hard to diagnose if you have the data, I told you it's the nan values 85 seconds after you posted the data :-P – Stefan Pochmann Jan 16 '18 at 01:36
  • @StefanPochmann yea, but you don't want to know what I had to go through to get that data. I mean, I represented the situation as I understood it. It's very hard to diagnose when programmatically examining the data, as all the obvious places to check turn out fine. I'll admit some wrong-doing here sure, but on the flip side, this is a super unintuitive error in python that I'd wager most people were totally not aware of. If I were aware that there was a magical float value that I could put in a list to make all sorts on it break you better believe I would have checked that first :) – Slater Victoroff Jan 16 '18 at 01:41
  • Ok granted I knew about nan's comparisons already and I was maybe lucky to see it right away because for security the first thing I did with your data was print `set(open(...).read())` and there I noticed the letters "n" and "a" :-) – Stefan Pochmann Jan 16 '18 at 01:42
  • Ah, right... I forgot about having to get the data first. I'm glad you were able to provide some after all, hadn't expected that after you said "proprietary and large". I really would've been happy with the 33 MB file as well, found that reasonable if you can't reproduce it with less. And maybe that actually helped, since I wrote my eye-opening security check because I wasn't going to read 33 MB :-) (and because I don't know how safe json.load is). – Stefan Pochmann Jan 16 '18 at 02:12
  • Possible duplicate of [Python: sort function breaks in the presence of nan](https://stackoverflow.com/questions/4240050/python-sort-function-breaks-in-the-presence-of-nan) – Mark Dickinson Jan 16 '18 at 21:22

1 Answers1

4

Your data contains nan ("not a number"). That messes sorting up, as nan compares to everything as False. Demo:

>>> sorted([0.3, float('nan'), 0.1])
[0.3, nan, 0.1]

To be clear what I mean with the comparisons:

>>> nan = float('nan')
>>> nan < 3
False
>>> nan > 3
False
>>> nan == 3
False
>>> nan == nan
False                 # yeah
Stefan Pochmann
  • 27,593
  • 8
  • 44
  • 107
  • Thanks so much for spending the time to figure this out! I had no idea that this is how `nan`'s would be handled here. I assumed that they would be less than every number, or more than every number, or something similar. Never did I realize that they were both more than and less than every number. I also had no idea that python would just do this sort with no complaints, in effect silently erroring and rendering any further use of the data invalid. – Slater Victoroff Jan 16 '18 at 01:20
  • It's not "both more than and less than every number" but *neither* more nor less. I added a few examples to make it clear. Though I'm sure it would trip the sorting up just as much if it were more and less. – Stefan Pochmann Jan 16 '18 at 01:33