Swift: concatenate two arrays without duplicating shared suffix/prefix

Question

I have a situation where I need the join two arrays which often have an overlapping subarray.

So if I have:

let input1 = ["A", "B", "F", "E"]

and

let input2 = ["F", "E", "A", "G"]

the result should be:

let result = ["A", "B", "F", "E", "A", "G"]

So it's a bit like a "union" in the sense that the output does not duplicate the shared subarray/intersection and contains both arrays in their original state (just overlapping/intersecting). The canonical way to do something similar is with Sets, but that would remove the second "A".

What is the algorithm you're using? This result doesn't look like what you're get from any obvious algorithm or standard meaning of "union." (I'm not saying it's wrong; it's just not obvious what algorithm generates this). — Rob Napier, Apr 15 '22 at 18:35
@RobNapier I had the same thought, but It looks (from his example) like you put them end to end and then snip out the common ones. Pythonically, I'd increment from 1 and compare A[-count:] to B[:count] and increment count if the sub arrays were equal and concatenate them together if not. — Russ Schultz, Apr 15 '22 at 18:38
What about using `OrderedSet` of the Swift-Collection (available with SPM): https://github.com/apple/swift-collections/blob/main/Documentation/OrderedSet.md — Larme, Apr 15 '22 at 19:33
Thanks for the thought. The problem is that this is still a Set and thus only allows one instance of each element. From the question you can see that "A" appears twice in the result. — jbm, Apr 15 '22 at 19:50
I misread the output, indeed, there is at the end of the array A, and start of array B, the same "sub array", that need to not be repeated. My bad. And there can't be repetition of an element in array A, and one in array B separately? Like having `["A", "A"]` for the first array? — Larme, Apr 15 '22 at 19:54
@Larme I don't think that would make a difference. ["A", "B", "F", "F"] and ["F", "F", "A", "G"] should result in ["A", "B", "F", "F", "F", "A", "G"] because they match the first and last element. — Leo Dabus, Apr 15 '22 at 20:19
@Larme, as @Leo Dabus suggests, the two arrays can contain whatever elements—i.e., it could be `["A", "A", "A", "B"] + ["A", "B", "B", "B"] = ["A", "A", "A", "B", "B", "B"]`. — jbm, Apr 15 '22 at 20:27

Joakim Danielson · Answer 1 · 2022-04-16T07:31:26.133

My previous solution didn't handled duplicates correctly so to fix this I added this function to Array to count the number of trailing duplicates

extension Array where Element: Equatable {
    func trailingDuplicates() -> Int {
        guard let last = self.last else { return 0 }
        let reversed = self.reversed().dropFirst()
        for (offset, element) in reversed.enumerated() {
            if last != element { return offset }
        }
        return 0
    }
}

And then I adjusted the result of calling lastIndex(of:) with the result of this new function in my old solution

func concatenateWithoutDuplicates<Value: Equatable>(_ first: [Value], with second: [Value]) -> [Value] {
    guard var firstIndex = first.lastIndex(of: second[0]) else {
        return first + second
    }

    firstIndex -= first.trailingDuplicates()
    let secondIndex = second.index(second.endIndex, offsetBy: -first[firstIndex..<first.endIndex].count)
    if first[firstIndex..<first.endIndex] == second[second.startIndex..<secondIndex] {
        return first + second[secondIndex..<second.endIndex]
    } else {
        return first + second
    }
}

Running some test examples from the question and comments

let testData = [
    (["A", "B", "F", "E"], ["F", "E", "A", "G"]),
    (["A", "B", "F", "F"], ["F", "F", "A", "G"]),
    (["A", "A", "A", "B"], ["A", "B", "B", "B"])
]

for datum in testData {
    print("\(datum.0), \(datum.1) -> \(concatenateWithoutDuplicates(datum.0, with: datum.1))")
}

["A", "B", "F", "E"], ["F", "E", "A", "G"] -> ["A", "B", "F", "E", "A", "G"]
["A", "B", "F", "F"], ["F", "F", "A", "G"] -> ["A", "B", "F", "F", "A", "G"]
["A", "A", "A", "B"], ["A", "B", "B", "B"] -> ["A", "A", "A", "B", "B", "B"]

Here is a brute force solution where I start by finding the last position in the first array of the first element of the second array and then directly compare the end of the first array with the start of the second array using the found index to create sub arrays.

let first = ["A", "B", "F", "E"]
let second = ["F", "E", "A", "G"]

let result: [String]
if let firstIndex = first.lastIndex(of: second[0]) {
    let secondIndex = second.index(second.endIndex, offsetBy: -first[firstIndex..<first.endIndex].count)

    if first[firstIndex..<first.endIndex] == second[second.startIndex..<secondIndex] {
        result = first + second[secondIndex..<second.endIndex]
    } else {
        result = first + second
    }
} else {
    result = first + second
}

Note that I haven't tested for any edge cases here, just the sample given in the question with some simple variants.

Thanks @Joakim Danielson, I'm going with this answer. I tried a few different cases of inputs and it seems to work. For more context, I'm receiving predictions from a GPT-2 (generative language model) which always appends continuations on the existing context. So the frontend has to make sure not to keep inserting the duplicated context on each call, but rather to just append the new prediction. I could do this on the backend (PyTorch) but for now I'm doing it on the frontend. Maybe I'll get downvoted for that, too. Haha... — jbm, Apr 15 '22 at 19:55
PS - Because you gave a good answer I didn't delete the question. It's clearly a valid problem as evidenced by the fact that you came up with a correct solution without having to ask any further questions. — jbm, Apr 15 '22 at 20:03
Note that this would result in `["A", "B", "F", "F", "F", "F", "A", "G"]` for `["A", "B", "F", "F"]` and `["F", "F", "A", "G"]` when IMO it should match a single element subsequence and result in `["A", "B", "F", "F", "F", "A", "G"]`. Btw not my downvote. — Leo Dabus, Apr 15 '22 at 20:21
Well, the shared bit here is just `["F", "F"]`, so the result should be `["A", "B", "F", "F", "A", "G"]`. — jbm, Apr 15 '22 at 20:33
so you need to iterate two indices at a time. You should add this to the question. Btw why are you using an array of characters instead of a string? — Leo Dabus, Apr 15 '22 at 20:35
Sorry, I often use "bit" instead of "part" (my partner's a Brit)... In fact, the final application is using strings, so it could be `"A B F F" and "F F A G"`. I was just wondering whether `hasSuffix`/`hasPrefix` might be better. — jbm, Apr 15 '22 at 20:50
@jbm you can use hasSuffix instead of starts with for strings but I don't see any advantage. — Leo Dabus, Apr 15 '22 at 21:07
@JoakimDanielson It was Larme that started the discussion about it. — Leo Dabus, Apr 15 '22 at 21:08

Leo Dabus · Accepted Answer · 2022-04-16T16:23:25.313

Just for fun you can use starts with predicate while iterating your first sequence from the end as follow:

let first: [String] = ["A", "B", "F", "E"]
let second: [String] = ["F", "E", "A", "G"]

var pos = first.endIndex
while pos > first.startIndex,
      second.starts(with: first[pos...], by: { $0 != $1}),
      !second.isEmpty {
    first.formIndex(before: &pos)
}

let result = first[..<pos] + second // ["A", "B", "F", "E", "A", "G"]

This will result in a SubSequence, in this case an array slice. If you need an array just explicitly set the resulting type:

let result: [String] = first[..<pos] + second

Based on OP comments if you need to match the subsequence by pairs just offset every two elements:

let first = "ABFF"
let second = "FFAG"

var pos = first.endIndex
while pos > first.startIndex,
      second.starts(with: first[pos...], by: { $0 != $1 }),
      !second.isEmpty {
    first.formIndex(&pos, offsetBy: -2)
}

let result: String = first[..<pos] + second  // "ABFFAG"

If you need the string elements separated by spaces:

var first = "A B C D E F G D E"
var second = "D E F C B A"

first.removeAll(where: \.isWhitespace)
second.removeAll(where: \.isWhitespace)

var pos = first.endIndex
while pos > first.startIndex,
      second.starts(with: first[pos...], by: { $0 != $1 }),
      !second.isEmpty {
    first.formIndex(&pos, offsetBy: -2)
}

let result = (first[..<pos] + second)
                 .map(String.init)
                 .joined(separator: " ")
result  // "A B C D E F G D E F C B A"

edit/update:

Following the logic shown at your last comment/answer you can do something like:

extension RangeReplaceableCollection where Element: Equatable {
    mutating func appendAndMerge<C: Collection>(with collection: C) where C.Element == Element {
        var lowerBound = startIndex
        formIndex(&lowerBound, offsetBy: Swift.min(count, count-collection.count), limitedBy: endIndex)
        while !collection.starts(with: self[lowerBound...]) {
            formIndex(&lowerBound, offsetBy: 1, limitedBy: endIndex)
        }
        replaceSubrange(lowerBound..., with: collection)
    }
}

Usage:

var first = ["at", "by", "chicken", "dog", "eat", "for", "good", "dog", "eat"]
let second =  ["good", "dog", "eat", "feed", "cats", "bonk", "atrophe"]
first.appendAndMerge(with: second)
print(first)

This will print

["at", "by", "chicken", "dog", "eat", "for", "good", "dog", "eat", "feed", "cats", "bonk", "atrophe"]

Using strings (collection of characters)

var first = "at by chicken dog eat for good dog eat"
let second =  "good dog eat feed cats bonk atrophe"
first.appendAndMerge(with: second)
print(first)

This will print:

"at by chicken dog eat for good dog eat feed cats bonk atrophe"

Interesting. I may try benchmarking these with some long arrays, just out of curiosity. Thanks! — jbm, Apr 15 '22 at 20:04
just noting that your last example is still tailored to strings. — jbm, Apr 15 '22 at 21:36
@jbm Not sure what you mean. You can use it with any type of collection as you can see in my strings example. The logic is exactly the same as well as explicitly setting the non SubSequence types. — Leo Dabus, Apr 15 '22 at 21:37
Hmm... Okay, probably because I was running in Playgrounds (which is handy, but also seems weirdly limited). — jbm, Apr 15 '22 at 21:38
I've gone with this as it's elegant and uses only native Swift. — jbm, Apr 16 '22 at 14:34
Yeah, this is great, thanks! My application is language (actually, music) generation with Transformers where the default backend behaviour is to append the prediction to the input context. In my client I'm keeping a long buffer of text and sending the last N tokens for prediction. The model will send back the context AND the prediction but I only want to append the prediction to my buffer (the buffer has the context). This would apply to handling responses from any predictive model of this type. — jbm, Apr 16 '22 at 16:04

score 0 · Answer 3 · answered Apr 16 '22 at 10:42

There are probably other answers here that are more efficient for Arrays. I was interested in a solution for all Sequences.

XCTAssertEqual(
  Array(
    chainWithoutOverlap(
      stride(from: 1, to: 10, by: 2),
      (5...15).filter { !$0.isMultiple(of: 2) }
    )
  ),
  .init(
    stride(from: 1, through: 15, by: 2)
  )
)

import Algorithms

@inlinable public func chainWithoutOverlap<Sequence1: Sequence, Sequence2: Sequence>(
  _ sequence1: Sequence1, _ sequence2: Sequence2
) -> Chain2Sequence<
  LazyMapSequence<
    LazyPrefixWhileSequence<
      UnfoldSequence<(Sequence1.Element, Sequence1.Iterator), Sequence1.Iterator>
    >,
    Sequence1.Element
  >,
  Sequence2
> where Sequence1.Element: Equatable, Sequence1.Element == Sequence2.Element {
  chain(
    sequence1.withDropIterators.lazy
      .prefix { !sequence2.starts(with: IteratorSequence($1)) }
      .map(\.0),
    sequence2
  )
}

public extension Sequence {
  /// Like `zip`ping with the iterators of all subsequences, incrementally dropping early elements.
  /// - Note: Begins with the iterator for the full sequence (dropping zero).
  @inlinable var withDropIterators: UnfoldSequence<(Element, Iterator), Iterator> {
    sequence(state: makeIterator()) {
      let iterator = $0
      return $0.next().map { ($0, iterator) }
    }
  }
}

score -1 · Answer 4 · edited Apr 16 '22 at 09:35

-1

This seems to work for what I need. The one thing I don't like is the check on (tail.count > second.count), which feels like a hack...

var first: [String] = ["at", "by", "chicken", "dog", "eat", "for", "good", "dog", "eat"]
var second: [String] = ["good", "dog", "eat", "feed", "cats", "bonk", "atrophe"]

var head: ArraySlice<String> = []
var tail: ArraySlice<String> = []
for i in stride(from: first.count-1, through: 0, by: -1) {
    tail = first[i...]
    if (tail.count > second.count) || second.prefix(tail.count) != tail {
        tail = []
    } else {
        head = first.prefix(i)
    }
}

let result = head + second
print(result) // ["at", "by", "chicken", "dog", "eat", "for", "good", "dog", "eat", "feed", "cats", "bonk", "atrophe"]

I also made this an extension... might be simpler in context (e.g., could handle token ids instead of just tokens):

extension Array where Element : Equatable {
    /**
     Extend the existing array with an array without repeating overlapping elements.
     */
    mutating func extend(withArray array: [Element]) {
        var head: [Element] = []
        var tail: [Element] = []
        for i in stride(from: self.count-1, through: 0, by: -1) {
            tail = Array(self[i...])
            if (tail.count > array.count) || Array(array[0 ..< tail.count]) != tail {
                tail = []
            } else {
                head = Array(self[..<i])
            }
        }
        if head.count > 0 {
            self = head + array
        } else {
            self += array
        }
    }
}

edited Apr 16 '22 at 09:35

E_net4

27,810
13
101
139

answered Apr 15 '22 at 21:37

jbm

1,248
10
22

`String` has no `substring(fromIndex:)` method. Btw String is not indexed by integers. I wonder why don't you use Swift native methods for this. Just remove the whitespaces before executing the logic – Leo Dabus Apr 15 '22 at 21:42
Ah, quite right! Sorry, my code already had the extension from here: https://stackoverflow.com/questions/24092884/get-nth-character-of-a-string-in-swift-programming-language. Re: removing white spaces; I'd have to replace them since this will be applied to actual text (hence why arrays could be the most generic solution). – jbm Apr 15 '22 at 21:57
1

I wouldn't use a method called substring that returns a string. Btw Index is also very misleading considering you need to pass an Int. Anyway it is up to you adding those extensions to your project. Just to let you know it wouldn't pass any decent review. – Leo Dabus Apr 15 '22 at 22:00
It will work for now. Honestly, I'm a founder and really only doing this for a prototype. Someone with much more skill as a coder will/should ultimately address the issues you're raising. But I'll poke away at an array-based solution as well. Shouldn't be hard now that I've done it this way. – jbm Apr 15 '22 at 22:12
Btw not my downvote but I definitely expected that to happen – Leo Dabus Apr 15 '22 at 22:14
Not sure also how this would work for the "bits" (pairs) you have mentioned before. "AAAA" and "AAAA" would result in "AAAA". Anyway good luck. – Leo Dabus Apr 15 '22 at 22:20
I answered your question about "bits" above... I meant "parts"... sorry for the confusion. – jbm Apr 15 '22 at 22:25
1

You can improve your code performance by declaring head and tail as ArraySlice. This would avoid duplicating the array elements in memory. You can also improve your syntax using prefix(n). `tail = first[i...]` `if tail.count > second.count || second.prefix(tail.count) != tail {` `tail = []` `} else {` `head = first.prefix(i)` `}`. Using the same technique I showed in my post you can make the result an array of strings explicitly setting the result type. `let result: [String] = head + second` – Leo Dabus Apr 15 '22 at 23:05

Swift: concatenate two arrays without duplicating shared suffix/prefix

4 Answers4