8

Suppose I have a file with integers (<10^6). I need to make a sorted array using those integers. Consider the following cases

  1. Case 1 : Copy all the data into an array and sort (assume O(nlgn)).
  2. Case 2 : Sort while inserting each element into the array.

Which is safer and why? Which is faster and why? If the number of integers is further increased (>10^9) ,what are the implications then?

I tried both the cases and 'after sorting' yielded better results in terms of speed.I can see why, but is there a better way to do case 2( Currently checking entering element with each element in array to find it's appropriate position).

magpie
  • 653
  • 1
  • 7
  • 12
  • When you keep array sorted you should use binary search to find the insertion point. Also, how do you actually *insert* into array? – PM 77-1 Dec 22 '17 at 15:51
  • Finding the appropriate position and shifting all else up one position. – magpie Dec 22 '17 at 16:00
  • 1
    Are the numbers unique? If so, you could insert them into a `SortedSet` and then call `toArray` on that set. If there are repeats, you could use a `TreeMap`, but more complicated. – tobias_k Dec 22 '17 at 16:04
  • Another approach for elements <= 10 ^ 7 where you don't have the total n elements all at once you could use Fenwick tree. Where each update take log(n) . This method will be lot faster than case 2. – Atul Dec 22 '17 at 16:38

3 Answers3

6

The problem with inserting the element into the already-sorted array (aka Insertion Sort) is: While finding the index where to insert the element can be done in O(log n) using binary search, when actually inserting the element, all the following elements have to be shifted, resulting in (on average) O(n/2) for inserting into an array[] or ArrayList.

When using a LinkedList, the elements do not have to be shifted, but here, you can't binary-search as well: Finding the index where to insert is about O(n/4) (according to this overview), and another O(n/4) for actually inserting the element, adding to O(n/2), same as for ArrayList. (You could create a custom Skip List providing faster lookup and simultaneous insertion, but AFAIK Java does not provide something like this.)

If the numbers are unique, you could consider inserting them into a TreeSet and then calling toArray at the end. Inserting each number will be O(log n), for a total of O(n log n), with an additional O(n) each time you get the numbers in sorted order. (According to your comment, they are not unique, but maybe this helps others.) You could still use a variant of this approach using a TreeMap, mapping elements to their count, but this will be more complex to implement.

Thus, collecting the numbers in an array[] or ArrayList or LinkedList first and then sorting at the end seems to be better -- provided, of course, that you do not need the sorted version of the array in each step.

The "sorted insert" would give you (on average) O(log n + n/2) = O(n/2) for inserting each of the n numbers, for a total of O(n²/2), while maintaining a sorted array at all times. Sorting at the end is O(1) for inserting each of the n numbers plus O(n log n) at the end (or whenever you need the sorted list in between), resulting in O(n + k n log n) = O(k n log n) for sorting k > 0 times. (And if we solve for k, we see that sorting at the end would be faster as long as k < n / 2 log n.)

tobias_k
  • 81,265
  • 12
  • 120
  • 179
  • A linked list sort could be implemented using [bottom up merge sort](https://en.wikipedia.org/wiki/Merge_sort#Bottom-up_implementation_using_lists) which merges nodes into a small array of pointers to lists. After all elements are received, the array is merged to form a single sorted list. However, a linked list sort is slower than array sort, so it's probably not practical to use a linked list if using an array is an option. – rcgldr Dec 23 '17 at 16:04
1

Instead of checking the new element against every element in the array (linear search) you should use binary search: check in which half of the array it belongs, then cut that half in half and so on. This will yield O(n log n), same as the sort-after approach.

O.O.Balance
  • 2,930
  • 5
  • 23
  • 35
  • While this is true, the insertion itself will be linear from the insertion point to the end of the array. Given that the implementation of the array is contiguous in memory, which I think most JVM arrays are? – MaLarsson Dec 22 '17 at 15:54
  • 1
    You are right of course. I was thinking about the question in mathematical terms of asymptotic runtime, not in actual programming terms. This means Case 1 will always be faster regardless of the asymptotical runtime - repeatedly inserting into an array is simply a bad idea when you can just build the array from start to finish. – O.O.Balance Dec 22 '17 at 15:58
1

Case 1: inserting n elements into an array takes n time, since appending to an array is constant time. Then sorting after takes n log n time, thus n log n in total.

Case 2: inserting a element into a sorted array at the 'correct' position takes log n time (with binary search). Doing this for n elements yields a running time of n log n.

Both take the same amount of time, but case 2 will use n memory while, depending on your sorting algorithm, case 1 can require more memory.

Oscar de Leeuw
  • 126
  • 2
  • 6
  • 2
    Well, _finding_ the position takes O(log n), but actually _inserting_ the element (and shifting all the following elements"one up") might take O(n) depending on the type of list. – tobias_k Dec 22 '17 at 15:55
  • In case of primitive arrays this will be useless but best used for LinkedList. Right? – magpie Dec 22 '17 at 16:04
  • 1
    @tobias_k The worst case complexity of case 2 is n ^ 2 ?? – Atul Dec 22 '17 at 16:44
  • 1
    @Atul Sounds about right, although in practice it's more like O(n²/2), as the list starts empty and you don't always have to shift all the elements. – tobias_k Dec 22 '17 at 16:54