8

I need a function which takes a list and return unique element if it exists or [] if it doesn't. If many unique elements exists it should return the first one (without wasting time to find others). Additionally I know that all elements in the list come from (small and known) set A. For example this function does the job for Ints:

unique :: Ord a => [a] -> [a]
unique li = first $ filter ((==1).length) ((group.sort) li)
    where first [] = []
          first (x:xs) = x

ghci> unique [3,5,6,8,3,9,3,5,6,9,3,5,6,9,1,5,6,8,9,5,6,8,9]
ghci> [1]

This is however not good enough because it involves sorting (n log n) while it could be done in linear time (because A is small). Additionally it requires the type of list elements to be Ord while all which should be needed is Eq. It would also be nice if amount of comparisons was as small as possible (ie if we traverse a list and encounter element el twice we don't test subsequent elements for equality with el)

This is why for example this: Counting unique elements in a list doesn't solve the problem - all answers involve either sorting or traversing the whole list to find count of all elements.

The question is: how to do it correctly and efficiently in Haskell ?

Community
  • 1
  • 1
Piotr Lopusiewicz
  • 2,514
  • 2
  • 27
  • 38

6 Answers6

12

Okay, linear time, from a finite domain. The running time will be O((m + d) log d), where m is the size of the list and d is the size of the domain, which is linear when d is fixed. My plan is to use the elements of the set as the keys of a trie, with the counts as values, then look through the trie for elements with count 1.

import qualified Data.IntTrie as IntTrie
import Data.List (foldl')
import Control.Applicative

Count each of the elements. This traverses the list once, builds a trie with the results (O(m log d)), then returns a function which looks up the result in the trie (with running time O(log d)).

counts :: (Enum a) => [a] -> (a -> Int)
counts xs = IntTrie.apply (foldl' insert (pure 0) xs) . fromEnum
    where
    insert t x = IntTrie.modify' (fromEnum x) (+1) t

We use the Enum constraint to convert values of type a to integers in order to index them in the trie. An Enum instance is part of the witness of your assumption that a is a small, finite set (Bounded would be the other part, but see below).

And then look for ones that are unique.

uniques :: (Eq a, Enum a) => [a] -> [a] -> [a]
uniques dom xs = filter (\x -> cts x == 1) dom
    where
    cts = counts xs

This function takes as its first parameter an enumeration of the entire domain. We could have required a Bounded a constraint and used [minBound..maxBound] instead, which is semantically appealing to me since finite is essentially Enum+Bounded, but quite inflexible since now the domain needs to be known at compile time. So I would choose this slightly uglier but more flexible variant.

uniques traverses the domain once (lazily, so head . uniques dom will only traverse as far as it needs to to find the first unique element -- not in the list, but in dom), for each element running the lookup function which we have established is O(log d), so the filter takes O(d log d), and building the table of counts takes O(m log d). So uniques runs in O((m + d) log d), which is linear when d is fixed. It will take at least Ω(m log d) to get any information from it, because it has to traverse the whole list to build the table (you have to get all the way to the end of the list to see if an element was repeated, so you can't do better than this).

luqui
  • 59,485
  • 12
  • 145
  • 204
  • Wish I could upvote more than once. It seems everyone else was ignoring the "finite, small domain" qualification. – John L Apr 18 '13 at 09:09
  • To be fair, the OQ specified that the elements were not ordered, and could only be compared. That said, adding an `Enum` instance is clearly the right thing to do, as it's hard to imagine a "finite, small domain" that wouldn't qualify, with O(1) `fromEnum`... – comingstorm Apr 18 '13 at 18:07
  • While I was interested in general case (of not ordered elements) the specific problem I had could easily use ordered type. I've learnt a lot from this answer. – Piotr Lopusiewicz Apr 18 '13 at 23:26
  • It however seems to me that it traverses the whole list even if it don't have to (in case all elements of A appears twice already). Is that correct ? – Piotr Lopusiewicz Apr 18 '13 at 23:33
  • 1
    @PiotrLopusiewicz ah, yes, it does. This solution can be altered to allow this though -- instead of using Ints as the values of the tree, use the values of a special monoid that tracks the problem structure exactly: `1 <> 1 = 2`, `2 <> 1 = 2`, etc. You also have to change the strict `foldl` in counts to a lazy `foldr` to allow the traversal to act lazily. I can post more details if you like. – luqui Apr 19 '13 at 14:43
6

There really isn't any way to do this efficiently with just Eq. You'd need to use some much less efficient way to build the groups of equal elements, and you can't know that only one of a particular element exists without scanning the whole list.

Also, note that to avoid useless comparisons you'd need a way of checking to see if an element has been encountered before, and the only way to do that would be to have a list of elements known to have multiple occurrences, and the only way to check if the current element is in that list is... to compare it for equality with each.

If you want this to work faster than O(something really horrible) you need that Ord constraint.


Ok, based on the clarifications in comments, here's a quick and dirty example of what I think you're looking for:

unique [] _ _ = Nothing
unique _ [] [] = Nothing
unique _ (r:_) [] = Just r
unique candidates results (x:xs)
    | x `notElem` candidates = unique candidates results xs
    | x `elem` results       = unique (delete x candidates) (delete x results) xs
    | otherwise              = unique candidates (x:results) xs

The first argument is a list of candidates, which should initially be all possible elements. The second argument is the list of possible results, which should initially be empty. The third argument is the list to examine.

If it runs out of candidates, or reaches the end of the list with no results, it returns Nothing. If it reaches the end of the list with results, it returns the one at the front of the result list.

Otherwise, it examines the next input element: If it's not a candidate, it ignores it and continues. If it's in the result list we've seen it twice, so remove it from the result and candidate lists and continue. Otherwise, add it to the results and continue.

Unfortunately, this still has to scan the entire list for even a single result, since that's the only way to be sure it's actually unique.

C. A. McCann
  • 76,893
  • 19
  • 209
  • 302
  • "and the only way to do that would be to have a list of elements known to have multiple occurrences" - you can "keep" a list of elements you encountered as you go (starting with list of all elements in A). If this list becomes empty at any point then terminate; I mean there is a way to do it in imperative programming so I hope there is some way to do it in Haskell as well. – Piotr Lopusiewicz Apr 17 '13 at 22:55
  • @PiotrLopusiewicz: There's no good way to do this in an imperative language either, under the constraints you're imposing. – C. A. McCann Apr 17 '13 at 22:58
  • I can do this: I start with set C of all elements from A (remember - A is small). I traverse a list element by element comparing every one of them with every element from C. If some element is encountered 2nd time I drop it from C. If C is still not empty I continue. If it becomes empty at any point I terminate. This takes numelA * n operations in pessimistic case which is linear. – Piotr Lopusiewicz Apr 17 '13 at 23:00
  • @PiotrLopusiewicz: Your constraints in the question don't let you have a "list of all elements from A". If you want to allow that, then the algorithm you suggest translates directly to Haskell. Having a specific type you're working with is very different from a polymorphic function. – C. A. McCann Apr 17 '13 at 23:01
  • "Your constraints in the question don't let you have a "list of all elements from A" - maybe my English is not good enough. I meant that "small set A" is known. I still have problems translating it to haskell because it requires keeping some state as I go from element to element which is not easy (for me). – Piotr Lopusiewicz Apr 17 '13 at 23:03
  • @PiotrLopusiewicz: Just walk down the list recursively the same way you'd loop through a collection imperatively, and add extra parameters to the recursive function for each piece of extra state. – C. A. McCann Apr 17 '13 at 23:09
2

First off, if your function is intended to return at most one element, you should almost certainly use Maybe a instead of [a] to return your result.

Second, at minimum, you have no choice but to traverse the entire list: you can't tell for sure if any given element is actually unique until you've looked at all the others.

If your elements are not Ordered, but can only be tested for Equality, you really have no better option than something like:

firstUnique (x:xs)
  | elem x xs = firstUnique (filter (/= x) xs)
  | otherwise = Just x
firstUnique [] = Nothing

Note that you don't need to filter out the duplicated elements if you don't want to -- the worst case is quadratic either way.


Edit:

The above misses the possibility of early exit due to the above-mentioned small/known set of possible elements. However, note that the worst case will still require traversing the entire list: all that is necessary is for at least one of these possible elements to be missing from the list...

However, an implementation that provides an early out in case of set exhaustion:

firstUnique = f [] [<small/known set of possible elements>] where
  f [] [] _ = Nothing  -- early out
  f uniques noshows (x:xs)
    | elem x uniques = f (delete x uniques) noshows xs
    | elem x noshows = f (x:uniques) (delete x noshows) xs
    | otherwise      = f uniques noshows xs
  f []    _ [] = Nothing
  f (u:_) _ [] = Just u

Note that if your list has elements which shouldn't be there (because they aren't in the small/known set), they will be pointedly ignored by the above code...

comingstorm
  • 25,557
  • 3
  • 43
  • 67
  • Thanks to laziness and Piotr's guarantee that the set of elements is small, the average case and best case will be a lot better than N log N even though the worst case is worse, because the average and best case when sorting is still N log N. – Daniel Lyons Apr 17 '13 at 23:01
  • "Second, at minimum, you have no choice but to traverse the entire list: you can't tell for sure if any given element is actually unique until you've looked at all the others." - this is not true - it might become apparent during the traversing that no unique element exists. I accept your point about maybe type - thanks. – Piotr Lopusiewicz Apr 17 '13 at 23:07
  • Sorry, I missed the bit about the small/fixed set. Let me think for a bit, and maybe come up with an edit... – comingstorm Apr 17 '13 at 23:12
  • I accepted different answer but this is very elegant and actually the fastest in my problem (the first one that is). – Piotr Lopusiewicz Apr 22 '13 at 03:34
2

As others have said, without any additional constraints, you can't do this in less than quadratic time, because without knowing something about the elements, you can't keep them in some reasonable data structure.

If we are able to compare elements, an obvious O(n log n) solution to compute the count of elements first and then find the first one with count equal to 1:

import Data.List (foldl', find)
import Data.Map (Map)
import qualified Data.Map as Map
import Data.Maybe (fromMaybe)

count :: (Ord a) => Map a Int -> a -> Int
count m x = fromMaybe 0 $ Map.lookup x m

add :: (Ord a) => Map a Int -> a -> Map a Int
add m x = Map.insertWith (+) x 1 m

uniq :: (Ord a) => [a] -> Maybe a
uniq xs = find (\x -> count cs x == 1) xs
  where
    cs = foldl' add Map.empty xs

Note that the log n factor comes from the fact that we need to operate on a Map of size n. If the list has only k unique elements then the size of our map will be at most k, so the overall complexity will be just O(n log k).

However, we can do even better - we can use a hash table instead of a map to get an O(n) solution. For this we'll need the ST monad to perform mutable operations on the hash map, and our elements will have to be Hashable. The solution is basically the same as before, just a little bit more complex due to working within the ST monad:

import Control.Monad
import Control.Monad.ST
import Data.Hashable
import qualified Data.HashTable.ST.Basic as HT
import Data.Maybe (fromMaybe)

count :: (Eq a, Hashable a) => HT.HashTable s a Int -> a -> ST s Int
count ht x = liftM (fromMaybe 0) (HT.lookup ht x)

add :: (Eq a, Hashable a) => HT.HashTable s a Int -> a -> ST s ()
add ht x = count ht x >>= HT.insert ht x . (+ 1)

uniq :: (Eq a, Hashable a) => [a] -> Maybe a
uniq xs = runST $ do
    -- Count all elements into a hash table:
    ht <- HT.newSized (length xs)
    forM_ xs (add ht)
    -- Find the first one with count 1
    first (\x -> liftM (== 1) (count ht x)) xs


-- Monadic variant of find which exists once an element is found.
first :: (Monad m) => (a -> m Bool) -> [a] -> m (Maybe a)
first p = f
  where
    f []        = return Nothing
    f (x:xs')   = do
        b <- p x
        if b then return (Just x)
             else f xs'

Notes:

  • If you know that there will be only a small number of distinct elements in the list, you could use HT.new instead of HT.newSized (length xs). This will save you some memory and one pass over xs but in the case of many distinct elements the hash table will be have to resized several times.
Petr
  • 62,528
  • 13
  • 153
  • 317
  • 1
    Since the OP knows the domain of the elements, a perfect hash might even be applicable here. Which would make a hash table a pretty good solution. Or, taking a hint from luqui, if you can rely upon `Enum,Bounded`, a mutable array/vector would likely be even better. – John L Apr 18 '13 at 09:13
1

Here is a version that does the trick:

unique :: Eq a => [a] -> [a]
unique =  select . collect []
  where
    collect acc []              = acc
    collect acc (x : xs)        = collect (insert x acc) xs

    insert x []                 = [[x]]
    insert x (ys@(y : _) : yss) 
      | x == y                  = (x : ys) : yss
      | otherwise               = ys : insert x yss

    select []                   = []
    select ([x] : _)            = [x]
    select ((_ : _) : xss)      = select xss

So, first we traverse the input list (collect) while maintaining a list of buckets of equal elements that we update with insert. Then we simply select the first element that appears in a singleton bucket (select).

The bad news is that this takes quadratic time: for every visited element in collect we need to go over the list of buckets. I am afraid that is the price you will have to pay for only being able to constrain the element type to be in Eq.

Stefan Holdermans
  • 7,990
  • 1
  • 26
  • 31
  • This is ok. The set is small so it's only quadratic over size of this set and not input size. It seems to me though that it keep comparing to all elements during the traverse but I still have problems imagining how Haskell code work exactly. – Piotr Lopusiewicz Apr 18 '13 at 23:28
0

Something like this look pretty good.

unique = fst . foldl' (\(a, b) c -> if (c `elem` b) 
                                    then (a, b) 
                                    else if (c `elem` a) 
                                         then (delete c a, c:b) 
                                         else (c:a, b)) ([],[]) 

The first element of the resulted tuple of the fold, contain what you are expecting, a list containing unique element. The second element of the tuple is the memory of the process remembered if an element has already been discarded or not.

About space performance.
As your problem is design, all the element of the list should be traversed at least one time, before a result can be display. And the internal algorithm must keep trace of discarded value in addition to the good one, but discarded value will appears only one time. Then in the worst case the required amount of memory is equal to the size of the inputted list. This sound goods as you said that expected input are small.

About time performance.
As the expected input are small and not sorted by default, trying to sort the list into the algorithm is useless, or before to apply it is useless. In fact statically we can almost said, that the extra operation to place an element at its ordered place (into the sub list a and b of the tuple (a,b)) will cost the same amount of time than to check if this element appear into the list or not.


Below a nicer and more explicit version of the foldl' one.

import Data.List (foldl', delete, elem)

unique :: Eq a => [a] -> [a]
unique = fst . foldl' algorithm ([], []) 
  where 
    algorithm (result0, memory0) current = 
         if (current `elem` memory0)
         then (result0, memory0)
         else if (current`elem` result0)
              then (delete current result0, memory) 
              else (result, memory0) 
            where
                result = current : result0
                memory = current : memory0

Into the nested if ... then ... else ... instruction the list result is traversed twice in the worst case, this can be avoid using the following helper function.

unique' :: Eq a => [a] -> [a]
unique' = fst . foldl' algorithm ([], []) 
  where 
    algorithm (result, memory) current = 
         if (current `elem` memory)
         then (result, memory)
         else helper current result memory []
            where
               helper current [] [] acc = ([current], [])
               helper current [] memory acc = (acc, memory)
               helper current (r:rs) memory acc 
                   | current == r    = (acc ++ rs, current:memory) 
                   | otherwise = helper current rs memory (r:acc)

But the helper can be rewrite using fold as follow, which is definitely nicer.

helper current [] _ = ([current],[])
helper current memory result = 
    foldl' (\(r, m) x -> if x==current 
                         then (r, current:m) 
                         else (current:r, m)) ([], memory) $ result
zurgl
  • 1,930
  • 1
  • 14
  • 20