3

I want to merge two arrays with sorted values into one. Since both source arrays are stored as succeeding parts of a large array, I wonder, if you know a way to merge them into the large storage. Meaning inplace merge.

All methods I found, need some external storage. They often require sqrt(n) temp arrays. Is there an efficient way without it?

I m using C#. Other languages welcome also. Thanks in advance!

Gilles 'SO- stop being evil'
  • 104,111
  • 38
  • 209
  • 254
  • What about [this](http://stackoverflow.com/questions/2331698/c-sharp-array-merge-without-dupes)? – Shadow The GPT Wizard Feb 20 '12 at 11:44
  • This question has been asked here several times already.Did you try a search first? – user492238 Feb 20 '12 at 11:59
  • 3
    seems a duplicate of: http://stackoverflow.com/q/4373307/650084 – armel Feb 20 '12 at 13:25
  • If the values are stored as succeeding parts of a larger array, you just want to sort the array, then remove consecutive values which are equal. – Ben Feb 20 '12 at 13:49
  • Thanks for the links! @ben sorting a sorted array is too expensive. – Anastasia Rushda Feb 20 '12 at 17:12
  • I am curious about how you created 2 arrays which are "succeeding parts of a large array". AFAIK it is not possible in C# to have a variable that points to a subset of an existing array without copying the elements (and thus making a new array). – Chris Shain Feb 20 '12 at 20:19
  • @ChrisShain thanks for being precise. I mixed the terms of arrays. the large one is the only one which corresponds to a croncrete System.Array object. Both small 'arrays' (ups, did it again) are abstract arrays - only a consequitive collection of elements,stored in the large array. Please dont blame me for using the term 'abstract' here. Its not related to any keyword. :) – Anastasia Rushda Feb 20 '12 at 21:07
  • No trouble- that makes the problem space much more clear. So your "arrays" are actually bounded ranges within a single larger array? – Chris Shain Feb 20 '12 at 21:09
  • @ChrisShain Right. This is, what makes an in-place algorithm so attractive. – Anastasia Rushda Feb 20 '12 at 21:12
  • OK, so now we get to the next question. What do you mean by "merge"? Do you mean "rearrange the elements such that the smallest (or largest) value in the combined list of values is placed at the beginning of the first range, followed by the second smallest, and so on, until the first range is filled with the smallest N elements, and then proceed to fill the second range with the N+1 .. M next smallest elements"? – Chris Shain Feb 20 '12 at 21:16
  • @ChrisShain I need the large array to come out sorted at the end. And I must take advantage from the information, both smaller ranges are sorted in advance. – Anastasia Rushda Feb 20 '12 at 21:30
  • OK, I think I understand what you are asking now. One more- is this classwork or homework of some kind? If so it should be tagged homework. – Chris Shain Feb 20 '12 at 21:33
  • The answer at stackoverflow.com/q/4373307/650084 (from armel above) makes it clear that this is a solvable problem, but not at all a trivial one, being the subject of multiple recent academic publications. Go there for the references and code and upvote the response. – Chiara Coetzee Feb 20 '12 at 22:00
  • @DerrickCoertzee Unfortunately, the algorithm described there is by no means able to do it without extra storage – Anastasia Rushda Feb 20 '12 at 22:33
  • @AnastasiaRushda, we would need to know your criteria for "too expensive". Why is sort and de-dupe, which are trivial to implement, too expensive? Have you timed the built-in sort for example? What about an in-place quicksort which can be parallelised? – Ben Feb 21 '12 at 10:39
  • @Ben if I would simply sort the large array this would mean O( n log n). A merge can be done in O(n)! i was hoping for a way to get O(n) but without the extra storage needed – Anastasia Rushda Feb 21 '12 at 12:24
  • @AnastasiaRushda Answer expanded. But: Why is `O(n log n)` too expensive? ***Is it in fact too expensive?*** Have you tried it? (I am assuming you have a practical issue here with a concrete definition of "too expensive".) – Ben Feb 21 '12 at 12:43

3 Answers3

4

AFAIK, merging two (even sorted) arrays does not work inplace without considerably increasing the necessary number of comparisons and moves of elements. See: merge sort. However, blocked variants exist, which are able to sort a list of length n by utilizing a temporary arrays of lenght sqrt(n) - as you wrote - by still keeping the number of operations considerably low.. Its not bad - but its also not "nothing" and obviously the best you can get.

For practical situations and if you can afford it, you better use a temporary array to merge your lists.

Haymo Kutschbach
  • 3,322
  • 1
  • 17
  • 25
2

If the values are stored as succeeding parts of a larger array, you just want to sort the array, then remove consecutive values which are equal.

void  SortAndDedupe(Array<T> a)
{
    // Do an efficient in-place sort
    a.Sort();
    // Now deduplicate
    int lwm = 0; // low water mark
    int hwm = 1; // High water mark
    while(hwm < a.length)
    {
        // If the lwm and hwm elements are the same, it is a duplicate entry.
        if(a[lwm] == a[hwm])
        {
            hwm++;
        }else{
            // Not a duplicate entry - move the lwm up
            // and copy down the hwm element over the gap.
            lwm++;
            if(lwm < hwm){
                a[lwm] = a[hwm];
            }
            hwm++;
        }
    }
    // New length is lwm
    // number of elements removed is (hwm-lwm-1)
}

Before you conclude that this will be too slow, implement it and profile it. That should take about ten minutes.

Edit: This can of course be improved by using a different sort rather than the built-in sort, e.g. Quicksort, Heapsort or Smoothsort, depending on which gives better performance in practice. Note that hardware architecture issues mean that the practical performance comparisons may very well be very different from the results of big O analysis.

Really you need to profile it with different sort algorithms on your actual hardware/OS platform.

Note: I am not attempting in this answer to give an academic answer, I am trying to give a practical one, on the assumption you are trying to solve a real problem.

Ben
  • 34,935
  • 6
  • 74
  • 113
  • thank you very much ben. but before i get to my profiler, I need to point out, the sort does not take the sorted state of the source into accout. hence it will cost the regular O(n log n). I was hoping for something in the range of your dedupe,regarding effort, which is O(n). – Anastasia Rushda Feb 21 '12 at 12:33
  • thanks for the edit. i consist on the big O importance, since we simply have huge data to handle. For large data the O notation does correspond to execution speed, as you know. Same reason, why I need inplace. large external storage would cause the gc much more work. – Anastasia Rushda Feb 21 '12 at 13:14
  • Even for large data big O does not necessarily correspond to performance. It depends not only on the relative expenses of comparison versus exchange, but also hardware issues like memory locality including paging and processor cache usage. You can't do better than code up three or four versions and compare their performance. In particular if you are concerned with the storage requirements of merge sort for theoretical rather than empirical reasons --- just try it and then you will know if the problem is real! Please tell us what you find though - I am very interested in your results. – Ben Feb 21 '12 at 13:31
  • i appreciate your efforts, Ben. But this simply does not answers the question. I am actually seeking a third solution to implement and test. If it existed, it would outperform the others most probably. Negating this would also negate the effort spent on that research topic. But maybe it simply doesnt exist. I will choose between the rest than. – Anastasia Rushda Feb 21 '12 at 21:25
  • 1
    @AnastasiaRushda I do understand the big O affinity. But regarding your concerns about temporary storage, I suggest you simply reuse that storage? A pooled solution would not require the GC at all!? This keeps the implementation clean and maintainable. – user492238 Feb 27 '12 at 14:39
2

Dont care about external storage. sqrt(n) or even larger should not harm your performance. You will just have to make sure, the storage is pooled. Especially for large data. Especially for merging them in loops. Otherwise, the GC will get stressed and eat up a considerable part of your CPU time / memory bandwidth.

user492238
  • 4,094
  • 1
  • 20
  • 26