11

The Online Encyclopedia of Integer Sequences supports search for sequences containing your query as a subsequence, eg. searching for subseq:212,364,420,428 will return the 8*n+4 sequence. (http://oeis.org/search?q=subseq:212,364,420,428)

This amazing feature was apparently implemented by Russ Cox as by http://oeis.org/wiki/User:Russ_Cox/OEIS_Server_Features, but it is not specified with what algorithm this is implemented.

I'm wondering how it is done. Clearly going through nearly a million of sequences for every search is impractical for a search engine. Just keeping an index (which is how the same Russ Cox did Google Code Regex Search) of the first number and brute forcing the rest also doesn't work, as numbers like 0 is in nearly all sequences. In fact some queries like 0 1 match a high percentage of the total database, so the algorithm needs a running time sensitive to the desired output size.

Does anyone happen to know how this feature is implemented?

Guy Coder
  • 24,501
  • 8
  • 71
  • 136
Thomas Ahle
  • 30,774
  • 21
  • 92
  • 114
  • 1
    I'm guessing suffix arrays, or suffix trees. The numbers in a sequence behave like characters in a string, and these data structures allow a length-k substring to be looked up in O(k) (or O(k log k) for suffix arrays) time, regardless of the "database" size, n. – j_random_hacker Aug 16 '14 at 13:53
  • 1
    @j_random_hacker notice that I'm talking about subsequences and not substrings. I don't know any generalization of suffix data structures that allow fast subsequence queries. – Thomas Ahle Aug 16 '14 at 14:05
  • IMO, this algo is very similar to google's search for pages that contain entered words. Only SMALL numbers are indexed by OEIS search engine. Try `http://oeis.org/search?q=1212+1364+1420+1428` – Egor Skriptunoff Aug 16 '14 at 14:08
  • Whoops, missed that sorry! I'm also not aware of any data structure that allows fast subsequence queries. My new guess would be the way that many search engines work: keep an "inverted index" that, for each number (up to some bound) records the list of sequences that this number appears in. Then intersect these lists pairwise (linear time per pair), starting with the two smallest lists. – j_random_hacker Aug 16 '14 at 14:09
  • @EgorSkriptunoff Sequences are usually around 100 terms. I think all terms of each sequence is indexed. – Thomas Ahle Aug 16 '14 at 14:11
  • Numbers like 0 and 1 that appear in nearly all the lists can be handled very nicely using this technique: you simply decide on a maximum sequence count bound, and avoid storing the list of containing sequences for any number (like 0 or 1) that appears in more than this many sequences. Saves lots of time and lots of space! – j_random_hacker Aug 16 '14 at 14:12
  • @j_random_hacker I don't think this would work. Inverse indices for each word of `3 2 1 0` might match nearly all sequences, but the final result would match nearly nothing. - because the order matters. – Thomas Ahle Aug 16 '14 at 14:12
  • Good point, but I think this will mostly only be a problem for small numbers (e.g. under 100), in which case it will suffice to index ordered pairs or triples too. Either all O(n^2) ordered pairs of small numbers could be generated from the query and then looked up, or only O(n) (e.g. by taking adjacent pairs, e.g. {(3,2), (1,0)} in your example). – j_random_hacker Aug 16 '14 at 14:20
  • 2
    Doesn't searching space-separated numbers on the OEIS ignore order? There are only ~250k sequences, so inverted indices should be more than sufficient. – Nabb Aug 16 '14 at 14:22
  • Basically all you need is to be able to generate a set of keys from any given query having the properties that (1) the intersection of the sets found by looking up these keys contains all the answers, and not too many non-answers (a filtering step can quickly eliminate the latter); and (2) at least one of the sets found is much smaller than the size of the database. There are probably many ways of approximately doing this, but if inverted indexes aren't sufficient on their own, I think adding keys for ordered pairs of small numbers will be. – j_random_hacker Aug 16 '14 at 14:23
  • And it seems that @Nabb is right -- I just tried the query `428 212 364 220` and got back the same 8 answers as the OP's original query. So I'm guessing inverted indices. – j_random_hacker Aug 16 '14 at 14:26
  • 1
    Sorry, it appears you have to write `subseq:x,y,z`. I just thought the 'space' version was a shorthand. I've updated my question. – Thomas Ahle Aug 16 '14 at 14:36
  • @j_random_hacker I agree that pair indices might speed things up, but it clearly would still be possible to create a database where this would be worse than brute force. And probably most databases would have such queries, yielding DDOS attacks. I don't even find it easy to analyze how this would perform in 'average'. It is entirely possible that this is what OEIS does, but I'm hoping they have something with better guarantees. – Thomas Ahle Aug 16 '14 at 14:43
  • 2
    From a few quick tests, I think OEIS *is* probably vulnerable to DDOS attacks. The query `3 2 1 0` took 2.1s and returned a statement saying "Found 149461 results, too many to show." A subsequent identical query gave the same result in 0s. A subsequent semantically equivalent query of `2 0 1 3` took 2.1s and again said "Found 149461 results, too many to show." So it seems they sensibly cache the result of recent queries, and in fact 2s is plenty of time for a brute-force search. I think there have been no DDOSes because no-one has anything against the OEIS! – j_random_hacker Aug 16 '14 at 15:06
  • Regarding adversarial analysis, here is how I would approach it: First of all, it's sensible, socially acceptable and helps analysis to place an upper bound U on the size of any query result set. Wearing our "Adversary" hat, and assuming a particular implementation (e.g. pairwise intersections using an inverted index), we can then look for a query that generates an intermediate result set of size much greater than U. (The engine may use "fail-fast" mechanisms to detect when this will be the case; ... – j_random_hacker Aug 16 '14 at 15:17
  • ... instead of assuming that there are none, or that we know what they are, we can fall back on looking for a query whose final result set has size <= U, since the engine *cannot* fail fast on such a query, assuming that it guarantees to return correct answers for every query whose answer has size <= U.) – j_random_hacker Aug 16 '14 at 15:17
  • I just tried `subseq:0,1,2,3` and it took around 7 seconds. (Now it is faster, due to caching, but similar searches are still slow). So something would indicate that they are indeed just using heuristic methods. – Thomas Ahle Aug 16 '14 at 15:23
  • So, how does the pairwise-intersection approach fare? If we can make one more assumption, it fares very well. The assumption I think we want to make (and therefore that we want to design the implementation to honour) is that, for any query with result set size <= U, there is at least one key generated whose result set has size O(U), i.e. bounded by a constant times U. ... – j_random_hacker Aug 16 '14 at 15:24
  • ... If the DB size is n and the number of query terms is bounded by m, and looking up the size of a key's result set is O(1), then query time is bounded by O(Um log U) (i.e. independent of n): just look at each of the m terms in the query, find the term t having smallest result set size, and loop through every element x in t's O(U)-sized result set, searching for it in the m-1 other result sets with binary search in O(log U) time each. – j_random_hacker Aug 16 '14 at 15:26
  • Seems like you could treat it like a custom sort and thus use a balanced binary tree (basically a segment tree). Treat `to the left of` as `less than` and `to the right of` as `greater than`. In practice, this means recursively split the sequence into 3 parts: the median and the list elements to the left and right of the median. Then do the same with the left and right lists. If each parent node (median or say element at index `size/2`) in the tree contains a multiset containing itself and the values of it's children, then it should allows for about `O(k log N)` queries. – Nuclearman Aug 17 '14 at 22:13
  • Though it can go badly if the sequence is long but contains few unique elements. A query for say `(10,20,35,50,30,72,25)` would find every `50` and remove the ones that don't have `(10,20,35)` to the left and `(30,72,25)` to the right. Then would find a `20` to the left and a `72` to the right of each `50` removing those that didn't have it. – Nuclearman Aug 17 '14 at 22:20

2 Answers2

3

My guess is part of the data is stored in an inverted index. That is each number is linked to a set of series, and when multiple sequences are entered, the set of common sequences is shown. This is extremely fast and used by almost every search engine.

Storing as a suffix trees or any linked data structure is useless for this application.

At least for some set of sequences (eg ax+b), I think it is better to save them parametrically rather than storing the actual sequence.

ElKamina
  • 7,747
  • 28
  • 43
0

First of all, that online search only seems to work with numbers up to a 1000. Does it work for larger numbers too? Secondly, just out of curiosity, for the example that you have provided, for some reason, OEIS does not list A000027, which is just natural numbers, but obviously it should match.

Database Based Solution

If this was implemented purely in DB, for a 4 item search, it would be something like this.

Tables

sequence {seqid, seqname, etc..}

seqitem {value, seqid, location }

Query

select si1.ds, si1.location, si2.location .... from seqitem si1, seqitem si2, seqitem si3, seqitem si4 where si1.seqid = si2.seqid and si2.seqid = si3.seqid and si3.seqid = si4.seqid and si1.location < si2.location and si2.location < si3.location and si3.location < si4.location and si1.value =$v1 and si2.value = $v2 and si3.value = $v3 and si4.value = $v4

Amrinder Arora
  • 686
  • 7
  • 17