Find all concatenations of two string in a huge set

Question

Given a set of 50k strings, I need to find all pairs (s, t), such that s, t and s + t are all contained in this set.

What I've tried

, there's an additional constraint: s.length() >= 4 && t.length() >= 4. This makes it possible to group the strings by length 4 prefixes and, separately, suffixes. Then for every string composed of length at least 8, I look up the set of candidates for s using the first four characters of composed and the set of candidates for t using its last four characters. This works, but it needs to look at 30M candidate pairs (s, t) for finding the 7k results.

This surprisingly high number of candidates comes from the fact, that the string are (mostly German) words from a limited vocabulary and the word starts and ends often the same. It's still much better than trying all 2.5G pairs, but much worse than I hoped.

What I need

As the additional constraint may get dropped and the set will grow, I'm looking for a better algorithm.

The "missing" question

There were complaints about me not asking a question. So the missing question mark is at the end of the next sentence. How can this be done more efficiently, ideally without using the constraint?

@ScaryWombat Actually, by stating that I need an algorithm, I am asking a question implicitly. I used to hope that it's rather obvious, but there seem to be question-mark-searching bots here. — maaartinus, Sep 10 '18 at 05:14
You wrote *I'm looking for a better algorithm* - I wish you luck — Scary Wombat, Sep 10 '18 at 05:17
@ScaryWombat My last comment wasn't directed against you. On the opposite, I appreciate you answering my comment. Anyway, do you think, it's unclear? — maaartinus, Sep 10 '18 at 05:19
Have you considered something like a list sorted first by length, then by alphabetical order? — Mad Physicist, Sep 10 '18 at 06:12
@MadPhysicist This would be good for finding all prefixes... actually, finding all prefixes is the way to go. You've just brought me to the idea of the answer by ErikE. Strange that I didn't come to it myself.... I used a similar idea recently. — maaartinus, Sep 10 '18 at 06:33

ErikE · Accepted Answer · 2018-09-13T14:30:38.257

Algorithm 1: Test pairs, not singles

One way could be, instead of working from all possible pairs to all possible composite strings containing those pairs, work from all possible composite strings and see if they contain pairs. This changes the problem from n^2 lookups (where n is the number of strings >= 4 characters) to m * n lookups (where m is the average length of all strings >= 8 characters, minus 7, and n is now the number of strings >= 8 characters). Here's one implementation of that:

int minWordLength = 4;
int minPairLength = 8;

Set<String> strings = Stream
   .of(
      "a", "abc", "abcdef", "def", "sun", "sunshine", "shine",
      "bear", "hug", "bearhug", "cur", "curlique", "curl",
      "down", "downstream", "stream"
   )
   .filter(s -> s.length() >= minWordLength)
   .collect(ImmutableSet.toImmutableSet());

strings
   .stream()
   .filter(s -> s.length() >= minPairLength)
   .flatMap(s -> IntStream
      .rangeClosed(minWordLength, s.length() - minWordLength)
      .mapToObj(splitIndex -> ImmutableList.of(
         s.substring(0, splitIndex),
         s.substring(splitIndex)
      ))
      .filter(pair ->
          strings.contains(pair.get(0))
          && strings.contains(pair.get(1))
      )
   )
   .map(pair ->
      pair.get(0) + pair.get(1) + " = " + pair.get(0) + " + " + pair.get(1)
   )
   .forEach(System.out::println);

Gives the result:

downstream = down + stream

This has average algorithmic complexity of m * n as shown above. So in effect, O(n). In the worst case, O(n^2). See hash table for more on the algorithmic complexity.

Explanation

Put all strings four or more characters long into a hash set (which takes average O(1) complexity for search). I used Guava's ImmutableSet for convenience. Use whatever you like.
filter: Restrict to only the items that are eight or more characters in length, representing our candidates for being a composition of two other words in the list.
flatMap: For each candidate, compute all possible pairs of sub-words, ensuring each is at least 4 characters long. Since there can be more than one result, this is in effect a list of lists, so flatten it into a single-deep list.
1. rangeClosed: Generate all integers representing the number of characters that will be in the first word of the pair we will check.
2. mapToObj: Use each integer combined with our candidate string to output a list of two items (in production code you'd probably want something more clear like a two-property value class, or an appropriate existing class).
3. filter: Restrict to only pairs where both are in the list.
map: Pretty up the results a little.
forEach: Output to the console.

Algorithm Choice

This algorithm is tuned to words that are way shorter than the number of items in the list. If the list were very short and the words were very long, then switching back to a composition task instead of a decomposition task would work better. Given that the list is 50,000 strings in size, and German words while long are very unlikely to exceed 50 characters, that is a 1:1000 factor in favor of this algorithm.

If on the other hand, you had 50 strings that were on average 50,000 characters long, a different algorithm would be far more efficient.

Algorithm 2: Sort and keep a candidate list

One algorithm I thought about for a little while was to sort the list, with the knowledge that if a string represents the start of a pair, all candidate strings that could be one of its pairs will be immediately after it in order, among the set of items that start with that string. Sorting my tricky data above, and adding some confounders (downer, downs, downregulate) we get:

a
abc
abcdef
bear
bearhug
cur
curl
curlique
def
down ---------\
downs         |
downer        | not far away now!
downregulate  |
downstream ---/
hug
shine
stream
sun
sunshine

Thus if a running set of all items to check were kept, we could find candidate composites in essentially constant time per word, then probe directly into a hash table for the remainder word:

int minWordLength = 4;

Set<String> strings = Stream
   .of(
      "a", "abc", "abcdef", "def", "sun", "sunshine", "shine",
      "bear", "hug", "bearhug", "cur", "curlique", "curl",
      "down", "downs", "downer", "downregulate", "downstream", "stream")
   .filter(s -> s.length() >= minWordLength)
   .collect(ImmutableSet.toImmutableSet());

ImmutableList<String> orderedList = strings
   .stream()
   .sorted()
   .collect(ImmutableList.toImmutableList());
List<String> candidates = new ArrayList<>();
List<Map.Entry<String, String>> pairs = new ArrayList<>();

for (String currentString : orderedList) {
   List<String> nextCandidates = new ArrayList<>();
   nextCandidates.add(currentString);
   for (String candidate : candidates) {
      if (currentString.startsWith(candidate)) {
         nextCandidates.add(candidate);
         String remainder = currentString.substring(candidate.length());
         if (remainder.length() >= minWordLength && strings.contains(remainder)) {
            pairs.add(new AbstractMap.SimpleEntry<>(candidate, remainder));
         }
      }
   }
   candidates = nextCandidates;
}
pairs.forEach(System.out::println);

Result:

down=stream

The algorithmic complexity on this one is a little more complicated. The searching part I think is O(n) average, with O(n^2) worst case. The most expensive part might be the sorting—which depends on the algorithm used and the characteristics of the unsorted data. So use this one with a grain of salt, but it has possibility. It seems to me that this is going to be way less expensive than building a Trie out of an enormous data set, because you only probe it once comprehensively and don’t get any amortization of the build cost.

Also, this time I chose a Map.Entry to hold the pair. It's completely arbitrary how you do it. Making a custom Pair class or using some existing Java class would be fine.

This definitely sounds interesting. It wouldn't work too well you had a bunch of really long words, since you'd end up checking all the different ways to split them into pairs, but what are the odds of that? — Mad Physicist, Sep 10 '18 at 06:16
True. This algorithm is tuned to words that are way shorter than the number of items in the list. If the list were very short and the words were very long, then switching back to a composition task instead of a decomposition task would work better. — ErikE, Sep 10 '18 at 06:19
You first algorithm is perfectly clear and does exactly what I need. Actually, it's something I should've found myself. The second one is strange, but very interesting. I wouldn't be scared about sorting; `n * log(n)` is something like `n * 20` and you save a factor of about 20 by not trying all split indexes. — maaartinus, Sep 10 '18 at 07:22
@maaartinus What are the counts of strings 4+ and 8+ in length, and what is the average length of the 8+ character strings? How many composites have more than one pair of components? (Like if bigredhouse were bigred + house and big + redhouse.) can you try these both on your data set and tell me what the performance characteristics are? — ErikE, Sep 10 '18 at 07:37
Total strings: 41906, 4+ strings: 39894, 8+ strings: 21854, their average length: 11.043, total composed strings: 6964, non-uniquely composed strings: 360. The first algorithm takes 110 ms, I haven't tried the other yet. Sorting the 39894 strings takes 43 ms, more than I expected. — maaartinus, Sep 10 '18 at 08:07
@maaartinus Wow, thanks! Most people I help never come back after being asked questions like that. I really appreciate it. Now I’m burning with curiosity about the second algorithm. — ErikE, Sep 10 '18 at 08:16
It's faster, 64 ms (i.e., 21 ms without sorting). You know, [benchmarking Java is hard](https://www.slideshare.net/howarddgreen/the-art-of-java-benchmarking), and what I did may be completely misleading, however, writing a proper benchmark would take me too much time ATM. I appreciate all useful answers and always come back, when there's something to say. — maaartinus, Sep 10 '18 at 08:27
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/179735/discussion-between-maaartinus-and-erike). — maaartinus, Sep 10 '18 at 09:06

score 1 · Answer 2 · answered Sep 10 '18 at 10:19

1

You can improve Erik’s answer by avoiding most of the sub-String creation using CharBuffer views and altering their position and limit:

Set<CharBuffer> strings = Stream.of(
    "a", "abc", "abcdef", "def", "sun", "sunshine", "shine",
    "bear", "hug", "bearhug", "cur", "curlique", "curl",
    "down", "downstream", "stream"
 )
.filter(s -> s.length() >= 4) // < 4 is irrelevant
.map(CharBuffer::wrap)
.collect(Collectors.toSet());

strings
    .stream()
    .filter(s -> s.length() >= 8)
    .map(CharBuffer::wrap)
    .flatMap(cb -> IntStream.rangeClosed(4, cb.length() - 4)
        .filter(i -> strings.contains(cb.clear().position(i))&&strings.contains(cb.flip()))
        .mapToObj(i -> cb.clear()+" = "+cb.limit(i)+" + "+cb.clear().position(i))
    )
    .forEach(System.out::println);

This is the same algorithm, hence doesn’t change the time complexity, unless you incorporate the hidden character data copying costs, which would be another factor (times the average string length).

Of course, the differences become significant only if you use a different terminal operation than printing the matches, as printing is quiet an expensive operation. Likewise, when the source is a stream over a large file, the I/O will dominate the operation. Unless you go into an entirely different direction, like using memory mapping and refactor this operation to operate over ByteBuffers.

answered Sep 10 '18 at 10:19

Holger

285,553
42
434
765

I was trying to do something like this with a `CharBuffer`, but this is far far nicer! – Eugene Sep 10 '18 at 11:49
While I do appreciate this answer and learned something, part of me feels a little... disquieted by the quantity of code lifted intact from my answer. I'm not sure if this feeling has any merit, but at least I had the opportunity to express it. :) – ErikE Sep 10 '18 at 17:56
@ErikE The answer names and links the source, so everyone can see what has changed and what hasn’t. Well, besides the change has been explained in the answer anyway. Isn’t it what posting code here is all about? To hope that someone uses it to built something new atop? – Holger Sep 11 '18 at 06:22
@Holger I don’t know. It doesn’t quite seem like good form to yank code. Maybe I am wrong. Maybe it should actually be flattering. – ErikE Sep 11 '18 at 06:24
@ErikE It should be flattering. After all, you’re publishing code under the CC here, for the very purpose of allowing its reuse. Some of the thousands of readers gave you an upvote, besides that you rarely learn about who is using your code somewhere. So you should be happy about this additional feedback. I could have written a similar piece of code which looks different, but why? I used this answer to acknowledge that your solution is very useful and my additional idea doesn’t change that usefulness. – Holger Sep 11 '18 at 06:32
@Holger It’s not a condemnation. Just a feeling. I’m not new to SO and haven’t encountered it before. – ErikE Sep 11 '18 at 07:08

score 0 · Answer 3 · answered Sep 10 '18 at 04:56

A possible solution could be this. You start with the first string as your prefix and the second string as your suffix. You go through each string. If the string begins with the first string, you check if it ends in the second string. And keep going until the end. To save some time before checking if the letters themselves are the same you could make a length check. It's pretty much what you made, but with this added length check you might be able to trim off a few. That's my take on it at least.

score 0 · Answer 4 · answered Sep 10 '18 at 06:02

0

Not sure if this is better than your solution but I think it's worth a try.

Build two Tries, one with the candidates in normal order, the other with the words reversed.

Walk the forwards Trie from depth 4 inwards and use the remainder of the leaf to determine the suffix (or something like that) and look it up in the backwards Trie.

I've posted a Trie implementation in the past here https://stackoverflow.com/a/9320920/823393.

answered Sep 10 '18 at 06:02

OldCurmudgeon

64,482
16
119
213

What do you think the algorithmic complexity of that would be? – ErikE Sep 10 '18 at 06:05
Instead of the backwards Trie we could also use any efficient Set implementation, right? – algrid Sep 10 '18 at 07:38

Find all concatenations of two string in a huge set

What I've tried

What I need

The "missing" question

4 Answers4

Linked