4

I measured a sorting program / algorithm and based on the runtime data, I have narrowed it down to two sorting algorithms - bubble sort and insertion sort.

Is there a way to know for sure which one it is? Without knowing the code of course.

They both have the same time complexity and I'm out of ideas.

Time complexity data:

  • Sorted: O(n) Time taken for 1000 numbers = 0.0047s
  • Random unsorted: O(n^2) Time taken for 1000 numbers = 0.021s
  • Descending order: O(n^2) Time taken for 1000 numbers = 0.04s

Thanks in advance!

  • You are missing important information: without knowing how fast your machine is, these numbers are not very helpful because they are too "absolute". If I tell you I went from point A to point B in 15 minutes, and distance from A to B is 1.5km, can you guess whether I stopped at the bakery to buy some bread? No, because you don't have enough information. At the very least I would need to tell you whether I'm walking or driving or taking the bus, and at what speed. – Stef Nov 10 '20 at 16:05
  • 2
    It would be helpful to try with lists of different sizes (100, 1000, 10000, 50000 elements), and comparing the results with results of insertion sort and bubble sort on the same lists. – Stef Nov 10 '20 at 16:06
  • I don't know what kind of control you have over the runtime environment, but if this were a C function, maybe you could arrange to call it on an invalid array whose latter values were not mapped, then inspect the core dump to see if the mapped prefix of the array is totally sorted (insertion sort) or not (bubble sort). – David Eisenstat Nov 10 '20 at 16:20
  • A less aggressive version of that idea is to run the algorithm on an a huge array and plot cache misses over time. Bubble sort will shoot up more quickly than insertion sort. – David Eisenstat Nov 10 '20 at 16:36
  • So you know *nothing* about the code other than that it's *some* bubble sort variant or *some* insertion sort variant? – superb rain Nov 10 '20 at 18:20
  • Check [edit1] in mine answer got an insane idea of obtaining the complexity from measured times programaticaly and it looks like it works ... – Spektre Nov 13 '20 at 12:55

2 Answers2

2
  1. Your 1000 elements for sort is too low

    the measured times are too low to represent a valid measurement (as majority of the time might not be used by the sort itself but initialization of window, openning files etc ...).

    you need times at least 100ms or more (1 sec is ideal).

  2. if you have access to data that is being sorted

    You can introduce data set that will be challenging for each type of sort (and from times infer used algo)... so for example bubble sort is slowest for sorted array in reverse order ... so pass sorted data ascending and descending and random and compare times. let call the times tasc,tdes,trnd and assuming ascending sort then if bubble sort is involved it should be:

    tasc O(n) < trnd  < tdes O(n^2)
    

    so:

    tasc*n == tdes + margin_of error
    

    so just test tdes/tasc is close to n ... with some margin for error ...

    so you just need to create a sample data that will be hard for specific type of sort and not for the others ... and from the times detect if it is the case or not until you find algo used.

    Here some data (all times are in [ms]) I tested on mine bubble sort and asc ordered data:

       n     tasc    tdesc    tasc*n
    1000  0.00321  2.96147  3.205750
    2000  0.00609 11.76799 12.181855
    4000  0.01186 45.58834 47.445111
    

    to be more clear if we have runtime for complexity O(n)

    t(O(n)) = c*n
    

    to convert to runtime with complexity O(n^2) (assuming the same constant time c):

    t(O(n^2)) = c*n*n = t(O(n)) * n
    

    This way you can compare times with different complexities you just need to convert all measured time into single common complexity...

  3. if you can chose sorted data size

    then as it was mentioned in comments you can infer the growth rate of the times with increasing n (doubling) from that you can estimate complexity and from that you can tell which algo was used.

    So lets assume measured times from #2 then for O(n) the constant time c should be the same so for tasc (O(n)):

       n     tasc    c=tasc/n
    1000  0.00321 0.000003210
    2000  0.00609 0.000003045 
    4000  0.01186 0.000002965 
    

    and for tdesc (O(n^2)):

       n     tdesc        tdesc/n^2
    1000   2.96147 0.00000296147000
    2000  11.76799 0.00000294199750
    4000  45.58834 0.00000284927125
    

    as you can see the c is more or less the same for both times tasc,tdesc which means they comply their estimated complexities O(n),O(n^2)

However without knowing what the tested App does is hard to be sure as sorting might be preceded by processing ... for example data might be scanned to detect the form (sorted, random, almost sorted ...) which is doable in O(n) and with the result along with data size it might chose which sorting algo to use ... So your measurements might measure different routines invalidating results ...

[edit1] I had an insane idea of detecting the complexity automaticaly

Simply by testing if constant time constant is more or less the same between all meassured times versus their corresponding n... Here simple C++/VCL code:

//$$---- Form CPP ----
//---------------------------------------------------------------------------
#include <vcl.h>
#include <math.h>
#pragma hdrstop
#include "Unit1.h"
//---------------------------------------------------------------------------
#pragma package(smart_init)
#pragma resource "*.dfm"
TForm1 *Form1;
//---------------------------------------------------------------------------
double factorial[]= // n[-],t[ms]
    {
     11,0.008,
     12,0.012,
     13,0.013,
     14,0.014,
     15,0.016,
     16,0.014,
     17,0.015,
     18,0.017,
     19,0.019,
     20,0.016,
     21,0.017,
     22,0.019,
     23,0.021,
     24,0.023,
     25,0.025,
     26,0.027,
     27,0.029,
     28,0.032,
     29,0.034,
     30,0.037,
     31,0.039,
     32,0.034,
     33,0.037,
     34,0.039,
     35,0.041,
     36,0.039,
     37,0.041,
     38,0.044,
     39,0.046,
     40,0.041,
     41,0.044,
     42,0.046,
     43,0.049,
     44,0.048,
     45,0.050,
     46,0.054,
     47,0.056,
     48,0.056,
     49,0.060,
     50,0.063,
     51,0.066,
     52,0.065,
     53,0.069,
     54,0.072,
     55,0.076,
     56,0.077,
     57,0.162,
     58,0.095,
     59,0.093,
     60,0.089,
     61,0.093,
     62,0.098,
     63,0.096,
     64,0.090,
     65,0.100,
     66,0.104,
     67,0.111,
     68,0.100,
     69,0.121,
     70,0.109,
     71,0.119,
     72,0.104,
     73,0.124,
     74,0.113,
     75,0.118,
     76,0.118,
     77,0.123,
     78,0.129,
     79,0.133,
     80,0.121,
     81,0.119,
     82,0.131,
     83,0.150,
     84,0.141,
     85,0.148,
     86,0.154,
     87,0.163,
     88,0.211,
     89,0.151,
     90,0.157,
     91,0.166,
     92,0.161,
     93,0.169,
     94,0.173,
     95,0.188,
     96,0.181,
     97,0.187,
     98,0.194,
     99,0.201,
    100,0.185,
    101,0.191,
    102,0.202,
    103,0.207,
    104,0.242,
    105,0.210,
    106,0.215,
    107,0.221,
    108,0.217,
    109,0.226,
    110,0.232,
    111,0.240,
    112,0.213,
    113,0.231,
    114,0.240,
    115,0.252,
    116,0.248,
    117,0.598,
    118,0.259,
    119,0.261,
    120,0.254,
    121,0.263,
    122,0.270,
    123,0.281,
    124,0.290,
    125,0.322,
    126,0.303,
    127,0.313,
    128,0.307,
      0,0.000
    };
//---------------------------------------------------------------------------
double sort_asc[]=
    {
    1000,0.00321,
    2000,0.00609,
    4000,0.01186,
       0,0.000
    };
//---------------------------------------------------------------------------
double sort_desc[]=
    {
    1000, 2.96147,
    2000,11.76799,
    4000,45.58834,
       0,0.000
    };
//---------------------------------------------------------------------------
double sort_rand[]=
    {
    1000, 3.205750,
    2000,12.181855,
    4000,47.445111,
       0,0.000
    };
//---------------------------------------------------------------------------
double div(double a,double b){ return (fabs(b)>1e-10)?a/b:0.0; }
//---------------------------------------------------------------------------
AnsiString get_complexity(double *dat)  // expect dat[] = { n0,t(n0), n1,t(n1), ... , 0,0 }
    {
    AnsiString O="O(?)";
    int i,e;
    double t,n,c,c0,c1,a,dc=1e+10;
    #define testbeg for (e=1,i=0;dat[i]>0.5;){ n=dat[i]; i++; t=dat[i]; i++;
    #define testend(s) if ((c<=0.0)||(n<2.0)) continue; if (e){ e=0; c0=c; c1=c; } if (c0>c) c0=c; if (c1<c) c1=c; } a=fabs(1.0-div(c0,c1)); if (dc>=a){ dc=a; O=s; }


    testbeg;            c=div(t,n);                 testend("O(n)");
    testbeg;            c=div(t,n*n);               testend("O(n^2)");
    testbeg;            c=div(t,n*n*n);             testend("O(n^3)");
    testbeg;            c=div(t,n*n*n*n);           testend("O(n^4)");
    testbeg; a=log(n);  c=div(t,a);                 testend("O(log(n))");
    testbeg; a=log(n);  c=div(t,a*a);               testend("O(log^2(n))");
    testbeg; a=log(n);  c=div(t,a*a*a);             testend("O(log^3(n))");
    testbeg; a=log(n);  c=div(t,a*a*a*a);           testend("O(log^4(n))");
    testbeg; a=log(n);  c=div(t,n*a);               testend("O(n.log(n))");
    testbeg; a=log(n);  c=div(t,n*n*a);             testend("O(n^2.log(n))");
    testbeg; a=log(n);  c=div(t,n*n*n*a);           testend("O(n^3.log(n))");
    testbeg; a=log(n);  c=div(t,n*n*n*n*a);         testend("O(n^4.log(n))");
    testbeg; a=log(n);  c=div(t,n*a*a);             testend("O(n.log^2(n))");
    testbeg; a=log(n);  c=div(t,n*n*a*a);           testend("O(n^2.log^2(n))");
    testbeg; a=log(n);  c=div(t,n*n*n*a*a);         testend("O(n^3.log^2(n))");
    testbeg; a=log(n);  c=div(t,n*n*n*n*a*a);       testend("O(n^4.log^2(n))");
    testbeg; a=log(n);  c=div(t,n*a*a*a);           testend("O(n.log^3(n))");
    testbeg; a=log(n);  c=div(t,n*n*a*a*a);         testend("O(n^2.log^3(n))");
    testbeg; a=log(n);  c=div(t,n*n*n*a*a*a);       testend("O(n^3.log^3(n))");
    testbeg; a=log(n);  c=div(t,n*n*n*n*a*a*a);     testend("O(n^4.log^3(n))");
    testbeg; a=log(n);  c=div(t,n*a*a*a*a);         testend("O(n.log^4(n))");
    testbeg; a=log(n);  c=div(t,n*n*a*a*a*a);       testend("O(n^2.log^4(n))");
    testbeg; a=log(n);  c=div(t,n*n*n*a*a*a*a);     testend("O(n^3.log^4(n))");
    testbeg; a=log(n);  c=div(t,n*n*n*n*a*a*a*a);   testend("O(n^4.log^4(n))");

    #undef testend
    #undef testbeg
    return O+AnsiString().sprintf(" error = %.6lf",dc);
    }
//---------------------------------------------------------------------------
__fastcall TForm1::TForm1(TComponent* Owner):TForm(Owner)
    {
    mm_log->Lines->Clear();
    mm_log->Lines->Add("factorial "+get_complexity(factorial));
    mm_log->Lines->Add("sort asc  "+get_complexity(sort_asc));
    mm_log->Lines->Add("sort desc "+get_complexity(sort_desc));
    mm_log->Lines->Add("sort rand "+get_complexity(sort_rand));
    }
//-------------------------------------------------------------------------

with relevant times measurements of mine fast exact bigint factorial where I simply used only the bigger times above 8ms, and also the sorting measurement from above which outputs this:

factorial O(n.log^2(n)) error = 0.665782
sort asc  O(n) error = 0.076324
sort desc O(n^2) error = 0.037886
sort rand O(n^2) error = 0.075000

The code just test few supported complexities and outputs the one that has lowest error (variation of c constant time between different n)...

Just ignore the VCL stuff and convert the AnsiString into any string or output you want ...

[edit2] I added gfx output (graph plot) for this to my app which revealed noise in some of mine measured data for example factorial:

noisy factorial

and filtering it a bit:

denoised factorial

The outputted complexity fits much better like for this factorial O(n^1.4)

Spektre
  • 49,595
  • 11
  • 110
  • 380
0

Well I have had proposed an improved version of insertion sort which reduces the complexity significantly. Please check if you want, I will leave the link below, however I was looking for something else and ended up in your conversation, hope if I could get some help as I am looking to know what should I do as I don’t have rights for this publication or technique that I proposed and someone copied mine exactly with some tweaks and published again.

ENHANCED INSERTION SORT https://www.ijcaonline.org/archives/volume64/number21/10761-5724

https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=697a42768e1d2bb2fff0432ee61c9810f965653b

  • This does not really answer the question. If you have a different question, you can ask it by clicking [Ask Question](https://stackoverflow.com/questions/ask). To get notified when this question gets new answers, you can [follow this question](https://meta.stackexchange.com/q/345661). Once you have enough [reputation](https://stackoverflow.com/help/whats-reputation), you can also [add a bounty](https://stackoverflow.com/help/privileges/set-bounties) to draw more attention to this question. - [From Review](/review/late-answers/33529486) – Quimby Jan 03 '23 at 23:31