0

So first up, I'm fully aware mutation is a bad idea, but I need to keep object creation down to a minimum as I have an incredibly huge amount of data to process (keeps GC hang time down and speeds up my code).

What I want is a scala collection that has a method like distinct or similar, or possibly a library or code snippet (but native scala preferred) such that the method is side effecting / mutating the collection rather than creating a new collection.

I've explored the usual suspects like ArrayBuffer, mutable.List, Array, MutableList, Vector and they all "create a new sequence" from the original rather than mutate the original in place. Am I trying to find something that does not exist? Will I just have to write my own??

I think this exists in C++ http://www.cplusplus.com/reference/algorithm/unique/

Also, mega mega bonus points if there is some kind of awesome tail recursive way of doing this so that any bookkeeping structures created are kept in a single stack frame that is thus deallocated from memory once the method exits. The reason this would be uber cool is then even if the method creates some instances of things in order to perform the removal of duplicates, those instance will not need to be garbage collected and therefore not contribute to massive GC hangs. It doesn't have to be recursion, as long as it's likely to cause the objects to go on the stack (see escape analysis here http://www.ibm.com/developerworks/java/library/j-jtp09275/index.html)

(Also if I can specify and fix the capacity (size in memory) of the collection that would also be great)

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
samthebest
  • 30,803
  • 25
  • 102
  • 142
  • I think you'll have to create your own - but it might not be as efficient as you hope if you're going to use, say, Arrays, since you'll need to shuffle down the rest of the elements to close the gap for removed elements? – The Archetypal Paul Dec 29 '14 at 10:55
  • It's also unlikely the housekeeping can be kept in a single stack frame, unless you're happy with an O(N^2) algorithm. In general, keeping track of which elements we've seen already is O(N) and that has to be put somewhere... – The Archetypal Paul Dec 29 '14 at 10:56
  • What properties do you need the collection to have (e.g. O(1) prepend/append, random access...) – The Archetypal Paul Dec 29 '14 at 10:57
  • I just need it to hold uniques so later I can call `size` to see how many uniques there are. I could use a `Set` of some kind, but then the distincting will likely happen frequently (probably on every `+`), which is inefficient, it only needs to happen when the capacity of some backing array is reached. To clarify, I would like the distinct operation to be fast, but the collection itself may only be a few 100 integers, but I'll need literally billions of such collections. – samthebest Dec 29 '14 at 11:08
  • @samthebest "...distincting will likely happen frequently (probably on every +)..." Yes but the use of a hash key to find the elements in a tree which maximum depth is 5 isn't that expensive. – Pablo Francisco Pérez Hidalgo Dec 29 '14 at 11:31
  • @PabloFranciscoPérezHidalgo yes fair point, but like I said on another comment, I'm trying to completely avoid nested structures as these are likely to create more pointers on the heap, which will later result in epic GC hangs. Put it this way, the amount of data I'm trying to process is being generated faster than a python script running on an i7. – samthebest Dec 29 '14 at 15:10
  • @Paul I think an O(N^2) algorithm might not actually be that bad if it's over a mutable array since these arrays should hopefully end up in the L1 cache. – samthebest Dec 29 '14 at 15:21
  • I think you should try a Set and measure. I suspect it won't be that bad. – The Archetypal Paul Dec 29 '14 at 17:00

2 Answers2

2

The algorithm (for C++), you mentioned is for consecutive duplicates. So if you need it for consecutive, you could use some LinkedList, but mutable lists was deprecated. On the other hand if you want something memory-efficient and agree with linear access - you could wrap your collection (mutable or immutable) with distinct iterator (O(N)):

def toConsDist[T](c: Traversable[T]) = new Iterator[T] {
    val i = c.toIterator
    var prev: Option[T] = None
    var _nxt: Option[T] = None
    def nxt = { 
       if (_nxt.isEmpty) _nxt = i.find(x => !prev.toList.contains(x))
       prev = _nxt
       _nxt 
    }        
    def hasNext = nxt.nonEmpty
    def next = {
       val next = nxt.get
       _nxt = None
       next
    }     
 }

 scala> toConsDist(List(1,1,1,2,2,3,3,3,2,2)).toList
 res44: List[Int] = List(1, 2, 3, 2)

If you need to remove all duplicates, it will be О(N*N), but you can't use scala collections for that, because of https://github.com/scala/scala/commit/3cc99d7b4aa43b1b06cc837a55665896993235fc (see LinkedList part), https://stackoverflow.com/a/27645224/1809978.

But you may use Java's LinkedList:

import scala.collection.JavaConverters._

scala> val mlist = new java.util.LinkedList[Integer]
mlist: java.util.LinkedList[Integer] = []

scala> mlist.asScala ++= List(1,1,1,2,2,3,3,3,2,2)
res74: scala.collection.mutable.Buffer[Integer] = Buffer(1, 1, 1, 2, 2, 3, 3, 3, 2, 2)

scala> var i = 0
i: Int = 0

scala> for(x <- mlist.asScala){ if (mlist.indexOf(x) != i) mlist.set(i, null); i+=1} //O(N*N)

scala> while(mlist.remove(null)){} //O(N*N)

scala> mlist
res77: java.util.LinkedList[Integer] = [1, 2, 3]

mlist.asScala just creates wrapper without any copying. You can't modify Java's LinkedList during iteration, that's why i used null's. You may try Java ConcurrentLinkedQueue, but it doesn't support indexOf, so you will have to implement it by yourself (scala maps it to the Iterator, so asScala.indexOf won't work).

Community
  • 1
  • 1
dk14
  • 22,206
  • 4
  • 51
  • 88
1

By definition, immutability forces you to create new objects whenever you want to change your collection.

What Scala provides for some collection are buffers which allow you to build a collection using a mutable interface and finally returning a immutable version but once you got your immutable collection you can't change its references in any way, that includes filtering as distinct. The furthest point you can reach concerning mutability in an immutable collection is changing its elements state when these are mutable objects.

On the other hand, some collections as Vector are implemented as trees (in this case as a trie) and insert or delete operations are implemented not by copying the entire tree but just the required branches.

From Martin Ordesky's Programming in Scala:

Updating an element in the middle of a vector can be done by copying the node that contains the element, and every node that points to it, starting from the root of the tree. This means that a functional update creates between one and five nodes that each contain up to 32 elements or subtrees. This is certainly more expensive than an in-place update in a mutable array, but still a lot cheaper than copying the whole vector.

  • I think `Vector` is a bit heavy weight as it will create a lot of instances of other things under the hood in order to define the trie. Also I'm not sure how you are answering the question since the `distinct` method will create a new instance. – samthebest Dec 29 '14 at 10:41
  • @samthebest `Vector` is an example to remark the fact that for data collections implemented using trees there are optimizations reducing the number of elements created by sharing branches and sub-trees. The best way I know of implementing `distinct` is by the means of a `Set` structure (dumping your data on it and iterating over its elements after that). The default immutable implementation of `Set` in Scala is quite similar to the one I described above for `Vector`. – Pablo Francisco Pérez Hidalgo Dec 29 '14 at 10:49