Remove elements of 2d array whose 1st column values are duplicates (Swift)

Question

This query is a small twist on a commonly asked question.

The goal is to filter a 2d array to remove duplicate element pairs whose first-column values are duplicates. For example:

[[1, 1, 1,  2, 2,  3, 3, 3, 3], [0.1, 0.15, 0.2,  0.05, 0.1,  0.2, 0.25, 0.3, 0.35]]
     ->    [[1, 2, 3],[0.2, 0.1, 0.35]]

Since the second-column values vary, there is obviously some discretion that needs to be applied when filtering: here, the last value of the set of duplicates is chosen.

One of the myriad answers to this related question -- a functional programming solution by Tim MB -- can be adapted to the task:

// Use FP-style filtering to eliminate repeated elements
let rawArray: [[Float]] = [...]
let filteredArray = rawArray
    .transpose
    .enumerated()
    .filter{ (rawArray[0]).lastIndex(of: $0.1[0]) == $0.0 }
    .map{ $0.1 }
    .transpose

However, this solution is rather slow, which is unfortunate because it's elegant.

A faster solution that keeps the FP spirit is to use dictionary hashing:

// Use array -> dict -> array trick to remove repeated elements
let rawArray: [[Float]] = [...]
let filteredArray = Array( Array(
    rawArray
        .transpose
        .reduce(into: [:], { dict, elements in
            dict[elements[0], default:(0,0)] = elements[1] 
        } )
        .map{ ($0.key, $0.value) } )
    .sorted{ $0.0 < $1.0 }
    .map{ [$0.0, $0.1] }
    .transpose) as! Array2D

My questions are:

Is this dictionary trick a good idea? Given that it uses floats as keys?
Why is the FP solution slow? Can it be speeded up?
Are there better alternatives?

"here, the last value of the set of duplicates is chosen." Is this an intentional desired outcome, or a free choice? Being flexible there could drastically change the available approaches. — Alexander, Feb 15 '20 at 03:30
Yeah, that thread suffers from having waaay too many answers, and it's very hard to separate the wheat from the chaff. — Colin Stark, Feb 15 '20 at 03:36
"here, the last value of the set of duplicates is chosen." - in this use case, yes, it's required. Other use cases might entail taking the first, or the majority, or the mean value. The dictionary trick (hack) could be made to work for .first, but not the other ops. — Colin Stark, Feb 15 '20 at 03:37

score 1 · Answer 1 · answered Feb 15 '20 at 12:43

Note on terminology: I'll use a to refer to your array, length to refer to its count (a.count), and width to refer to its elements widths (a[0].count).

There are a few things here that are each pretty brutal on your performance.

Transposing

Firstly, each array transposition is O(width * height). Depending on the implementation, it could also be particularly rough on your cache. And you do it twice. Thus, it's an important goal to avoid transposition is possible.

In your case, since you have vectors with only two elements, you can use zip to iterate your two column vectors in tandem. The result a sequence that does so lazily, so no copying happens, and no extra memory or time is used.

Deduplication

The implementation of deduplication that you stumbled on (.filter{ (rawArray[0]).lastIndex(of: $0.1[0]) == $0.0 }) is hot garbage. It's also O(width * height). It's actually worse than approaches that use Array.contains to maintain an array of "already seen" elements. When contains is looking for an element, it can bail-early when it finds a match. lastIndex(of:) always has to go through the entire array, never early-returning because there could always be a later instance of the sought-after element.

Where possible, use an implementation that takes advantage of the Hashability of your elements. Using a Set to track the "already seen" elements allows you to do O(1) contains checks, over array's O(count). I strongly recommend Cœur's implementation.

There's only one catch: that implementation only keeps the first elements, not the last. Luckily enough, that's really easy to fix: just reverse the elements, unique them (keeping the firsts of the reversed elemtents is like keeping the lasts of the original elements), and reverse them back.

My solution:

extension Sequence {
    /// Returns an array containing, in order, the first instances of
    /// elements of the sequence that compare equally for the keyPath.
    func unique<T: Hashable>(for keyPath: KeyPath<Element, T>) -> [Element] {
        var unique = Set<T>()
        return filter { unique.insert($0[keyPath: keyPath]).inserted }
    }
}

let points = zip(array[0], array[1])
let pointsUniquedByXs = points.reversed() // O(1) for collections
            .unqiue() // O(count)
            .reversed() // O(1) until you need to materalize as a reversed collection

Leo Dabus · Answer 2 · 2020-02-15T17:57:59.127

You can accomplish what you want by first filtering the indices of the first array which the element is the first occurrence in reverse order. Then you just need to map the subsequences using them:

let rawArray: [[Float]] = [[1, 1, 1, 2, 2, 3, 3, 3, 3], [0.1, 0.15, 0.2, 0.05, 0.1, 0.2, 0.25, 0.3, 0.3]]
var set: Set<Float> = []
let indices = rawArray
    .first?
    .indices
    .reversed()
    .filter { set.insert(rawArray.first![$0]).inserted }
    .reversed() ?? []
let result = rawArray.map { elements in indices.map { elements[$0] } }
print(result) //  [[1, 2, 3], [0.2, 0.1, 0.3]]

Another option is to create two empty subsequences, iterate the first rawArray subsequence indices reversed and try to insert the float value into a set, if inserted append the corresponding elements to the subsequences, then you just need to recreate the resulting array with those two new sequences reversed:

let rawArray: [[Float]] = [[1, 1, 1, 2, 2, 3, 3, 3, 3], [0.1, 0.15, 0.2, 0.05, 0.1, 0.2, 0.25, 0.3, 0.3]]
var set: Set<Float> = []
var sub1: [Float] = []
var sub2: [Float] = []
rawArray[0].indices.reversed().forEach {
    let value = rawArray[0][$0]
    if set.insert(value).inserted {
        sub1.append(value)
        sub2.append(rawArray[1][$0])
    }
}
let result: [[Float]] = [sub1.reversed(), sub2.reversed()] // [[1, 2, 3], [0.2, 0.1, 0.3]]

You can make it even faster if the result array is declared as a reversed collection of floating points. It would be O(1) for [ReversedCollection<[Float]>] instead of O(n) for [[Float]] for each subsequence.

Why this is better compare to the "dictionary trick" mentioned in the question? The most problematic is "reversing", which requires all data have to be available in advance. I prefer the mentioned "dictionary trick", which could be continuously evaluated while new data arrived (or added). — user3441734, Feb 17 '20 at 11:35
I didn’t say it is better but if you take a look at the last approach you will notice that I iterate the collection only once. — Leo Dabus, Feb 17 '20 at 14:37
You would need to benchmark all approaches to see the difference between them — Leo Dabus, Feb 17 '20 at 14:39

Colin Stark · Answer 3 · 2020-02-15T05:27:02.870

Thanks to Alexander, here is solution adapted from Coeur's method in the long related thread.

let rawArray: [[Float]] = [[1, 1, 1,  2, 2,  3, 3, 3, 3],
                           [0.1, 0.15, 0.2,  0.05, 0.1,  0.2, 0.25, 0.3, 0.35]]
let filteredArray = rawArray
    .transpose
    .reversed()
    .map{ ($0[0],$0[1]) }
    .unique(for: \.0)
    .map{ [$0.0,$0.1] }
    .reversed()
    .transpose

All that cruft arises because the data is a two-column float array rather than a 1d array of tuples, and because the last rather than the first duplicate value is required to be selected.

For this to work, Array needs to have the following extensions, the first courtesy of Alexander and Coeur, the second (revision) thanks to Leo Dabus:

extension RangeReplaceableCollection {
    /// Returns a collection containing, in order, the first instances of
    /// elements of the sequence that compare equally for the keyPath.
    func unique<T: Hashable>(for keyPath: KeyPath<Element, T>) -> Self {
        var unique = Set<T>()
        return filter { unique.insert($0[keyPath: keyPath]).inserted }
    }
}

extension RandomAccessCollection where Element: RandomAccessCollection {
    /// Peform a transpose operation
    var transpose: [[Element.Element]] {
        guard !isEmpty,
            var index = first?.startIndex,
            let endIndex = first?.endIndex
            else { return [] }
        var result: [[Element.Element]] = []
        while index < endIndex {
            result.append(map{$0[index]})
            first?.formIndex(after: &index) }
        return result
    }
}

you can simplify it even further `extension RandomAccessCollection where Element: RandomAccessCollection { var transpose: [[Element.Element]] { guard !isEmpty, var index = first?.startIndex, let endIndex = first?.endIndex else { return [] } var result: [[Element.Element]] = [] while index < endIndex { result.append(map{$0[index]}) first?.formIndex(after: &index) } return result } }` — Leo Dabus, Feb 15 '20 at 05:19
Another approach `extension Array where Element: RandomAccessCollection { var transpose: [[Element.Element]] { guard allSatisfy({ $0.count == first?.count }) else { return [] } return first?.indices.map { index in map{$0[index]} } ?? [] } }` — Leo Dabus, Feb 15 '20 at 05:55

Remove elements of 2d array whose 1st column values are duplicates (Swift)

3 Answers3

Transposing

Deduplication

My solution: