9

I am producing flat lists with 10^6 to 10^7 Real numbers, and some of them are repeating.

I need to delete the repeating instances, keeping the first occurrence only, and without modifying the list order.

The key here is efficiency, as I have a lot of lists to process.

Example (fake):

Input:

  {.8, .3 , .8, .5, .3, .6}

Desired Output

  {.8, .3, .5, .6}  

Aside note

Deleting repeating elements with Union (without preserving order) gives in my poor man's laptop:

DiscretePlot[a = RandomReal[10, i]; First@Timing@Union@a, {i, 10^6 Range@10}]

enter image description here

Dr. belisarius
  • 60,527
  • 15
  • 115
  • 190

3 Answers3

9

You want DeleteDuplicates, which preserves list order:

In[13]:= DeleteDuplicates[{.8, .3, .8, .5, .3, .6}]

Out[13]= {0.8, 0.3, 0.5, 0.6}

It was added in Mathematica 7.0.

Michael Pilat
  • 6,480
  • 27
  • 30
  • 4
    @Michael: I'd rather use `DeleteDuplicates[...,Equal]`, given that the numbers in question are real, and default comparison is `SameQ`. Much slower, but more robust. – Leonid Shifrin Mar 09 '11 at 13:59
  • hmm, I always forget about the built-ins like `DeleteDuplicates`. For what its worth, as a built-in in runs 3 - 6 times faster across ~10^6 elements than `unsortedUnion` in my answer. (Note, this is using `SameQ` not `Equal`.) – rcollyer Mar 09 '11 at 15:06
  • @Leonid would it work to `Round` the numbers first, to something just less than `$MachinePrecision` and use plain `DeleteDuplicates` since `DeleteDuplicates[...,Equal]` is so very slow, and belisarius wants speed? – Mr.Wizard Mar 09 '11 at 15:28
  • +1 I love Mma having all those constructs. @Leonid, in my particular case SameQ works OK (because of how the numbers are generated). Also, with SameQ this is twice as fast as Union[ ]!, but Union[ _ , SameTest->SameQ] takes much longer. Any ideas why is that so? – Dr. belisarius Mar 09 '11 at 15:40
  • 1
    @Mr.Wizard: This might work, but not sure if what you propose is robust enough, I wouldn't rely on `SameQ` in general for numerics. But, as @belisarius notes, for his purposes `SameQ` seems fine. @belisarius Ok, then `DeleteDuplicates` is your friend indeed. Another (robust but slower) alternative is `Tally[list, Equal][[All, 1]]`. Regarding `Union` being slow with explicit test, this is mainly because it switches to quadratic-time algorithm, see this thread for instance: http://groups.google.com/group/comp.soft-sys.math.mathematica/browse_thread/thread/9e1086e0fac30bb3 – Leonid Shifrin Mar 09 '11 at 15:59
  • 1
    @belisarius, just checked timing on Union and it runs about twice as fast as this code. So, much for the utility of my solution. Removed it, as it cannot compete. – rcollyer Mar 09 '11 at 16:19
  • @Leonid do you know if a case where `Round` will give outputs that do not match under `SameQ` where one might expect them to? – Mr.Wizard Mar 09 '11 at 16:21
  • @Mr.Wizard There are cases when `Round` will affect the results of `SameQ`, the simplest probably being `0 === 0.` vs `Round[0.] === Round[0]`. And generally, my impression is that rounding does not guarantee SameQ for machine precision numbers. But I'd be also concerned with the rounding procedure itself, since I don't see a general well-defined way to round machine-precision numbers for the purposes of `SameQ`, if you want to stay within the machine-precision arithmetic (I may be wrong). You can go to fixed higher precision, but this would likely be much slower and destroy the purpose. – Leonid Shifrin Mar 09 '11 at 17:36
  • @Leonid Thank you. Is there a section of your book that deals with this? – Mr.Wizard Mar 09 '11 at 17:41
  • @belisarius Upon a second thought, I would not rely on `SameQ` for machine-precision numbers regardless of the way they are obtained. I'd use the version with `Tally` that I mentioned - it is about 4 times slower but it will not lead to some nasty surprises. Just my 2 cents :) – Leonid Shifrin Mar 09 '11 at 18:59
  • @Mr. Wizard Unfortunately, no. My current book is very basic. I am working (slowly :)) on the second edition which hopefully will cover much more. – Leonid Shifrin Mar 09 '11 at 19:01
  • @Leonid I'll test intensively before assuming the SameQ test is OK. The calculation details are too boring to post here, but the tests I did so far are encouraging. Thanks for the warning. – Dr. belisarius Mar 09 '11 at 20:33
  • With mma 7 (Macintosh) `Sort@DeleteDuplicates@list` seems much faster that `Union@list`. I am wondering if this is a general phenomenon or just a quirk of the system I am using? – 681234 Mar 10 '11 at 01:03
9

Not to compete with other answers, but I just could not help sharing a Compile - based solution. The solution is based on building a binary search tree, and then checking for every number in the list, whether its index in the list is the one used in building the b-tree. If yes, it is the original number, if no - it is a duplicate. What makes this solution interesting for me is that it shows a way to emulate "pass-by-reference" with Compile. The point is that, if we inline compiled functions into other Compiled functions (and that can be achieved with an "InlineCompiledFunctions" option), we can refer in inner functions to the variables defined in outer function scope (because of the way inlining works). This is not a true pass-by-reference, but it still allows to combine functions from smaller blocks, without efficiency penalty (this is more in the spirit of macro-expnsion). I don't think this is documented, and have no idea whether this will stay in future versions. Anyways, here is the code:

(* A function to build a binary tree *)
Block[{leftchildren , rightchildren},
makeBSearchTree = 
Compile[{{lst, _Real, 1}},
Module[{len = Length[lst], ctr = 1, currentRoot = 1},
 leftchildren = rightchildren =  Table[0, {Length[lst]}];
 For[ctr = 1, ctr <= len, ctr++,
  For[currentRoot = 1, lst[[ctr]] != lst[[currentRoot]],(* 
   nothing *),
   If[lst[[ctr]] < lst[[currentRoot]],
    If[leftchildren[[currentRoot]] == 0,
     leftchildren[[currentRoot]] = ctr;
     Break[],
     (* else *)
     currentRoot = leftchildren[[currentRoot]] ],
    (* else *)
    If[rightchildren[[currentRoot]] == 0,
     rightchildren[[currentRoot]] = ctr;
     Break[],
     (* else *)
     currentRoot = rightchildren[[currentRoot]]]]]];
 ], {{leftchildren, _Integer, 1}, {rightchildren, _Integer, 1}},
CompilationTarget -> "C", "RuntimeOptions" -> "Speed",
CompilationOptions -> {"ExpressionOptimization" -> True}]];


(* A function to query the binary tree and check for a duplicate *)
Block[{leftchildren , rightchildren, lst},
isDuplicate = 
Compile[{{index, _Integer}},
Module[{currentRoot = 1, result = True},
 While[True,
  Which[
   lst[[index]] == lst[[currentRoot]],
    result = index != currentRoot;
    Break[],
   lst[[index]] < lst[[currentRoot]],
    currentRoot = leftchildren[[currentRoot]],
   True,
    currentRoot = rightchildren[[currentRoot]]
   ]];
 result
 ],
{{leftchildren, _Integer, 1}, {rightchildren, _Integer, 
  1}, {lst, _Real, 1}},
CompilationTarget -> "C", "RuntimeOptions" -> "Speed",
CompilationOptions -> {"ExpressionOptimization" -> True}
]];


(* The main function *)
Clear[deldup];
deldup = 
Compile[{{lst, _Real, 1}},
  Module[{len = Length[lst], leftchildren , rightchildren , 
     nodup = Table[0., {Length[lst]}], ndctr = 0, ctr = 1},
makeBSearchTree[lst]; 
For[ctr = 1, ctr <= len, ctr++,
 If[! isDuplicate [ctr],
  ++ndctr;
   nodup[[ndctr]] =  lst[[ctr]]
  ]];
Take[nodup, ndctr]], CompilationTarget -> "C", 
"RuntimeOptions" -> "Speed",
CompilationOptions -> {"ExpressionOptimization" -> True,
 "InlineCompiledFunctions" -> True, 
 "InlineExternalDefinitions" -> True}];

Here are some tests:

In[61]:= intTst = N@RandomInteger[{0,500000},1000000];

In[62]:= (res1 = deldup[intTst ])//Short//Timing
Out[62]= {1.141,{260172.,421188.,487754.,259397.,<<432546>>,154340.,295707.,197588.,119996.}}

In[63]:= (res2 = Tally[intTst,Equal][[All,1]])//Short//Timing
Out[63]= {0.64,{260172.,421188.,487754.,259397.,<<432546>>,154340.,295707.,197588.,119996.}}

In[64]:= res1==res2
Out[64]= True

Not as fast as the Tally version, but also Equal - based, and as I said, my point was to illustrate an interesting (IMO) technique.

Leonid Shifrin
  • 22,449
  • 4
  • 68
  • 100
  • I don't understand half of what you wrote, so I'll have to trust you know what you're doing, but it sounds cool, so +1. Some day I am going to have to learn a lower level language, and then maybe I'll understand. – Mr.Wizard Mar 09 '11 at 20:31
  • @Leonid +1 Thanks for sharing these ideas. I'm sure I'll learn a quite a few things – Dr. belisarius Mar 10 '11 at 00:03
  • 1
    @Leonid This is an interesting technique and somewhat unconventional use of `Compile`. The performance if your compiled code is hampered by calls back to Mathematica from compiled code. You can see them by loading `Needs["CompiledFunctionTools"]` (n.b.: use backtick), and evaluating `CompilePrint[makeBSearchTree]`. You will see occurrences of `MainEvaluate` meaning a call back to Mathematica. – Sasha Apr 22 '11 at 15:38
  • 1
    @Sasha: At first, I got confused by your observation. But then, I looked at the final code of `deldup` (through `CompilePrint`), and found exactly what I was hoping for (i.e. what I expected intuitively from my limited experiences with this technique): the compiler is smart enough to remove the calls to `MainEvaluate` from the functions being inlined into another Compiled function! So, If there is inefficiency here, it is not due to the calls to mma evaluator, I think. The status of auxiliary functions is that of building blocks, they are not supposed to be used on their own. – Leonid Shifrin Apr 22 '11 at 17:28
  • 1
    @Leonid Yes, you are right! In that case one can drop `CompilationTarget->"C"` in those auxiliary functions. `Compile` will be using only their Mathematica representation when building `deldup`. From the look at compiled print the code is as efficient as it gets, so I think `Tally` simply uses more efficient algorithm, coupled with better optimized code, I guess. – Sasha Apr 22 '11 at 17:39
  • @Sasha Yes, you are right about dropping the `CompilationTarget->"C"` part, I just checked and it works (and indeed, it is logical that inlining must of course happen on the mma, not C, level). And this is a good thing, since compilation to C takes time, which is wasted here for auxiliary functions. – Leonid Shifrin Apr 22 '11 at 17:51
5

For versions of Mathematica before 7, and for general interest, here are several ways of implementing the UnsortedUnion (i.e. DeleteDuplicates) function. These are collected from the help docs and MathGroup. They have been adjusted to accept multiple lists which are then joined, in analogy to Union.

For Mathematica 4 or earlier

UnsortedUnion = Module[{f}, f[y_] := (f[y] = Sequence[]; y); f /@ Join@##] &

For Mathematica 5

UnsortedUnion[x__List] := Reap[Sow[1, Join@x], _, # &][[2]]

For Mathematica 6

UnsortedUnion[x__List] := Tally[Join@x][[All, 1]]

From Leonid Shifrin for Mathematica 3+ (?)

unsortedUnion[x_List] := Extract[x, Sort[Union[x] /. Dispatch[MapIndexed[Rule, x]]]]
Mr.Wizard
  • 24,179
  • 5
  • 44
  • 125
  • 1
    @Mr. And the is no "archeologist badge" here! +1 – Dr. belisarius Mar 09 '11 at 18:57
  • Although it probably dosen't matter too much, Ted Ersek points out that the UnsortedUnion Version 1 will fail with lists such as {3, 1, 1, f[3], 3, 2, f[3], 1, 1, 8, 2, 6, 8}. (He discusses the origin of this function, {due originally to Carl Woll?} [here](http://www.verbeia.com/mathematica/tips/Tricks.html), under 'Clever Little Programs' His Version: ClearAll[DeleteRepititions]; DeleteRepititions[x_]:= Module[{t}, t[n_]:= (t[n] = Sequence[]; n); t/@x]) – 681234 Mar 10 '11 at 13:54
  • @TomD Well caught. `Block` changed to `Module` – Mr.Wizard Mar 10 '11 at 14:13
  • 2
    @Mr.Wizard Here is another one to add to your collection :) : `unsortedUnion[x_List] := Extract[x, Sort[Union[x] /. Dispatch[MapIndexed[Rule, x]]]]`. Works since I think v.4.2 (when Dispatch was introduced). For large lists, about 2-3 times faster than the `Reap-Sow` version. I discuss it in my book here: http://www.mathprogramming-intro.org/book/node479.html – Leonid Shifrin Mar 10 '11 at 14:54
  • @Leonid Thanks. I have not read that section yet. I am sure I will learn a lot before I finish your book. Then I shall await version 2. – Mr.Wizard Mar 10 '11 at 15:20
  • @belisarius hey look, there is now: http://stackoverflow.com/badges/1286/archaeologist – Mr.Wizard Feb 17 '12 at 05:43