76

I have a million integers in sorted order and I would like to find the longest subsequence where the difference between consecutive pairs is equal. For example

1, 4, 5, 7, 8, 12

has a subsequence

   4,       8, 12

My naive method is greedy and just checks how far you can extend a subsequence from each point. This takes O(n²) time per point it seems.

Is there a faster way to solve this problem?

Update. I will test the code given in the answers as soon as possible (thank you). However it is clear already that using n^2 memory will not work. So far there is no code that terminates with the input as [random.randint(0,100000) for r in xrange(200000)] .

Timings. I tested with the following input data on my 32 bit system.

a= [random.randint(0,10000) for r in xrange(20000)] 
a.sort()
  • The dynamic programming method of ZelluX uses 1.6G of RAM and takes 2 minutes and 14 seconds. With pypy it takes only 9 seconds! However it crashes with a memory error on large inputs.
  • The O(nd) time method of Armin took 9 seconds with pypy but only 20MB of RAM. Of course this would be much worse if the range were much larger. The low memory usage meant I could also test it with a= [random.randint(0,100000) for r in xrange(200000)] but it didn't finish in the few minutes I gave it with pypy.

In order to be able to test the method of Kluev's I reran with

a= [random.randint(0,40000) for r in xrange(28000)] 
a = list(set(a))
a.sort()

to make a list of length roughly 20000. All timings with pypy

  • ZelluX, 9 seconds
  • Kluev, 20 seconds
  • Armin, 52 seconds

It seems that if the ZelluX method could be made linear space it would be the clear winner.

Simd
  • 19,447
  • 42
  • 136
  • 271
  • 3
    How quickly do YOU think this be done? – Mitch Wheat Aug 10 '13 at 08:00
  • What's to stop you from looping over all the elements, finding a(n - 1) - a(n-2), then seeing if that value equals a(n) - a(n - 1) for all n in your list then seeing how long it takes? – Kon Aug 10 '13 at 08:01
  • 1
    The subsequence can miss points out as in the example I gave. – Simd Aug 10 '13 at 08:03
  • 44
    I don't understand all the downvotes, and especially not the close votes (off-topic, better suited for [su]? *Seriously?*). It's an interesting problem that I'd like to see answers to, too. So, +1 from me. – Tim Pietzcker Aug 10 '13 at 08:18
  • 7
    @user2179021: You might improve your question by including the code you already have. That seems to calm some of the more critical SO users. Don't worry about the downvotes for now. – Tim Pietzcker Aug 10 '13 at 08:19
  • 3
    @TimPietzcker I'm with you here, I've seen much worser questions, I think it good one here – Roman Pekar Aug 10 '13 at 08:31
  • I think good answer for this will have either faster algorithm or clear analysis why it's not possible – Roman Pekar Aug 10 '13 at 08:33
  • 1
    Some quick googling revealed this paper: http://theory.cs.uiuc.edu/~jeffe/pubs/pdf/arith.pdf – georg Aug 10 '13 at 08:36
  • @thg435: I'm no computer scientist, but it seems that paper only covers progressions through adjacent indices. – Tim Pietzcker Aug 10 '13 at 08:40
  • 2
    Could you substantify what you consider a valid "subsequence"? Any subset of points from the original set while remaining order, but not necessarily contiguous? Or something different? – orlp Aug 10 '13 at 09:01
  • @nightcracker Any subset of points from the original set while remaining order, but not necessarily contiguous is exactly right. – Simd Aug 10 '13 at 09:05
  • 1
    Don't know if this helps but another way to look at the problem is to take differences between adjacent items (here: `3, 1, 2, 1, 4`) and trying to group them so that adjacent groups have the same sum: `3, (1, 2, 1), (4)` – Michael Butscher Aug 10 '13 at 09:07
  • 7
    In the example, what decides that `4, 8, 12` is the correct output over `1, 4, 7` which is an equally long sequence? – FThompson Aug 10 '13 at 09:09
  • 1
    @Vulcan They are both optimal. – Simd Aug 10 '13 at 09:42
  • while I was writing the code, you've change the task and input :( I thought you want to find sequences where distance between each element is equal and difference also equal – Roman Pekar Aug 10 '13 at 09:49
  • 1
    Can two adjacent integers be identical? Or is there always at least a difference of `1`? – Tim Pietzcker Aug 10 '13 at 09:59
  • @TimPietzcker I can make sure that there is always a difference of at least 1 if that helps. – Simd Aug 10 '13 at 10:21
  • @RomanPekar Sorry for any confusion. It's just the difference that has to be equal. – Simd Aug 10 '13 at 10:21
  • With Armin's solution, you need to optimize the range tested. If you have a list like this `1,2,3,4,5,10000000` you only need to check 1-5. The range tested should be equal to the highest difference in the list that is less than `range/2` – גלעד ברקן Aug 12 '13 at 10:05
  • could you please test my updated code for correctness and speed? at my machine it works about 15 sec on 20000 random elements, sometimes less than 5, and it worked correctly on small test cases, but may be I missed something.. – Roman Pekar Aug 12 '13 at 21:04
  • Don't have time to develop my idea further at the moment, but consider the dual problem: From the sequence (of differences) {3,1,2,1,4} find the longest set of equal sequential sums. So here there's {{3},{1,2}} and {{1,2,1},{4}}. – Mark Hurd Aug 14 '13 at 05:01
  • In your speed comparison at the the end of your question -- did you change ZelluX and Kluev's code to output the actual sequence? If you did not, this would not be a fair comparison with Armin's code (does the length of the sequence alone answer the question?) By the way, I updated my code so it works with the examples in the comments to Kluev's answer. – גלעד ברקן Aug 14 '13 at 18:08
  • @user2179021 have you tested my code on your benchmark? – Roman Pekar Aug 15 '13 at 12:12
  • 1
    Not every set has arithmetic subsequences. For example, the powers of 2. However, if a set has positive density, then it has long arithmetic subsequences. This is Szemerédi's theorem (1975). For any density d and length k, there exists N(d,k) such that every subset of 1..n (where n>N) of density d contains a length k arithmetic progression. The primes have density zero (PNT), but Ben Green and Terence Tao [recently proved](http://arxiv.org/abs/math.NT/0404188) (non constructively) there are arbitrarily long arithemetic progressions in there. The longest known (26) is https://oeis.org/A204189 – Colonel Panic Aug 15 '13 at 14:30
  • @ColonelPanic: "Not every set has arithmetic subsequences" --- if it has at least two elements, it has a trivial two-element arithmetic subsequence. *Beyond* that, I agree with your family of counterexamples. – Jonas Kölker Aug 18 '13 at 12:21

10 Answers10

19

We can have a solution O(n*m) in time with very little memory needs, by adapting yours. Here n is the number of items in the given input sequence of numbers, and m is the range, i.e. the highest number minus the lowest one.

Call A the sequence of all input numbers (and use a precomputed set() to answer in constant time the question "is this number in A?"). Call d the step of the subsequence we're looking for (the difference between two numbers of this subsequence). For every possible value of d, do the following linear scan over all input numbers: for every number n from A in increasing order, if the number was not already seen, look forward in A for the length of the sequence starting at n with a step d. Then mark all items in that sequence as already seen, so that we avoid searching again from them, for the same d. Because of this, the complexity is just O(n) for every value of d.

A = [1, 4, 5, 7, 8, 12]    # in sorted order
Aset = set(A)

for d in range(1, 12):
    already_seen = set()
    for a in A:
        if a not in already_seen:
            b = a
            count = 1
            while b + d in Aset:
                b += d
                count += 1
                already_seen.add(b)
            print "found %d items in %d .. %d" % (count, a, b)
            # collect here the largest 'count'

Updates:

  • This solution might be good enough if you're only interested in values of d that are relatively small; for example, if getting the best result for d <= 1000 would be good enough. Then the complexity goes down to O(n*1000). This makes the algorithm approximative, but actually runnable for n=1000000. (Measured at 400-500 seconds with CPython, 80-90 seconds with PyPy, with a random subset of numbers between 0 and 10'000'000.)

  • If you still want to search for the whole range, and if the common case is that long sequences exist, a notable improvement is to stop as soon as d is too large for an even longer sequence to be found.

Armin Rigo
  • 12,048
  • 37
  • 48
  • This solution might be good enough if you're only interested in values of d that are relatively small; for example, if getting the best result for d <= 1000 would be good enough. Then the complexity goes down to `O(n*1000)`, which might run in less than a minute for n=1000000 (try also PyPy). – Armin Rigo Aug 10 '13 at 10:45
  • Nice idea if range is very limited – RiaD Aug 10 '13 at 10:58
  • But If `d <= 1000` that you may just remove duplicates, you will have at most 1000 elements and solve it in O(1000^2) that will work few seconds at most I brlieve. – RiaD Aug 10 '13 at 11:05
  • No, if A has one million numbers that are between 0 and 10'000'000, then `m = 10'000'000`. But if we restrict ourselves to `d <= 1000` we are looking for sequences within the whole A that have at most 1000 as a step. – Armin Rigo Aug 10 '13 at 11:11
  • Ah, I get what do you mean – RiaD Aug 10 '13 at 11:13
  • Whoops, missed the fact that the input is in sorted order! +2 :) – j_random_hacker Aug 10 '13 at 17:24
  • Nice reusing of the [Longest Consecutive Sequence](http://discuss.leetcode.com/questions/1070/longest-consecutive-sequence) idea. – lcn Aug 10 '13 at 20:16
  • Thank you. I think the best property of this solution is the bound as a function of d. O(nm) looks bad but O(nd) looks much better. How do you actually output an optimal solution with your code? – Simd Aug 10 '13 at 20:20
  • You remember the largest `count` that the inner loop has come to, and the corresponding values of `a` and `b` are the bounds. – Armin Rigo Aug 11 '13 at 07:07
  • @ArminRigo yours dynamic algorithm is good, but it doesnt' working for situations where number of elements in a list is not so long, but maximum is big. I've tried 300 elements with maximum element = 1000000 - it's too slow :( – Roman Pekar Aug 11 '13 at 07:18
  • @RomanPekar for these cases where n is small, look at the other algorithms presented on this page, which seem to be of a complexity depending only on n. – Armin Rigo Aug 11 '13 at 15:03
  • @RomanPekar i think m should be the median, try it with 300 elements, highest 1000000 and `for d in range(1, [median of array])`. It makes no sense to test a difference greater than the median, since you can't find a sequence longer than two elements that way. – גלעד ברקן Aug 12 '13 at 00:52
  • @RomanPekar actually, median might be the wrong term, I think maybe I just meant `(highest-lowest)/2` – גלעד ברקן Aug 12 '13 at 01:04
  • @groovy actually it doesn't mean much, the list is sorted, so if list is seqential like 1, 2, 3, 4, 5 then `lowest` = start index, `highest` = end index and (`highest` - `lowest`) / 2 is ~ N / 2 so algorithm is O(N^2). If it's not sequential than it's even larger. I don't see why this algorithm is upvoted so much - IMHO it will not work for even for 10000 elements with arbitrary spaced sequences – Roman Pekar Aug 12 '13 at 06:08
  • @RomanPekar that would depend on the distribution, I think. If you have a list like this 1,2,3,4,5,10000000 you only need to check 1-5. The range tested should be equal to the highest difference in the list that is less than `range/2`. – גלעד ברקן Aug 12 '13 at 10:07
  • Have you tried find longest sequence in 5000 long random elements list with this algorithm? I have no time now, may be later I'll try – Roman Pekar Aug 12 '13 at 10:11
  • @ArminRigo Actually, a preliminary optimization should be -- `test the outer differences (S[1] - S[0] and S[N-1] - S[N-2]); if any are greater than range/2, remove the corresponding outer element and readjust the range; repeat until false`. After that, still use a range equal to the highest difference in the list that is less than `range/2` in the algorithm. – גלעד ברקן Aug 12 '13 at 10:49
  • Is the given complexity correct here? You seem to forget the full while loop. – KillianDS Aug 12 '13 at 15:23
  • I don't understand why you suggest the possibility of searching for d greater than the largest difference smaller than half the range since a sequence of more than two elements with that difference could not exist, and there is no shortage of two-element sequences. – גלעד ברקן Aug 12 '13 at 20:10
  • @KillianDS: the innermost while loop doesn't change the complexity. If you fix a value of d, it will loop at most once per item, which is at most n times in total. – Armin Rigo Aug 13 '13 at 05:25
  • @groovy: indeed, if you like you can most of the time stop at half the range. As I said in my second *Update* above, you can more generally stop at `range / a` when the best answer found so far is `a`. – Armin Rigo Aug 13 '13 at 05:29
  • @user2179021 and Armin, That's nice. For `d <= 1000`, my code outputs for 1,000,000 random numbers between 1 and 10,000,000 in 60-70 seconds on my old IBM Thinkpad using PyPy. – גלעד ברקן Aug 14 '13 at 23:28
  • Note: If you are interested only on the longest seq, you could limit `d` based on the longest sequence you've found so far. – Karoly Horvath Aug 15 '13 at 12:03
12

UPDATE: I've found a paper on this problem, you can download it here.

Here is a solution based on dynamic programming. It requires O(n^2) time complexity and O(n^2) space complexity, and does not use hashing.

We assume all numbers are saved in array a in ascending order, and n saves its length. 2D array l[i][j] defines length of longest equally-spaced subsequence ending with a[i] and a[j], and l[j][k] = l[i][j] + 1 if a[j] - a[i] = a[k] - a[j] (i < j < k).

lmax = 2
l = [[2 for i in xrange(n)] for j in xrange(n)]
for mid in xrange(n - 1):
    prev = mid - 1
    succ = mid + 1
    while (prev >= 0 and succ < n):
        if a[prev] + a[succ] < a[mid] * 2:
            succ += 1
        elif a[prev] + a[succ] > a[mid] * 2:
            prev -= 1
        else:
            l[mid][succ] = l[prev][mid] + 1
            lmax = max(lmax, l[mid][succ])
            prev -= 1
            succ += 1

print lmax
ZelluX
  • 69,107
  • 19
  • 71
  • 104
  • That's nice and clean. Can it be made to run in linear space? – Simd Aug 11 '13 at 20:17
  • @user2179021 The line lmax = max(lmax, l[mid][succ]) updates lmax, and if you want an optimal subsequence, you can save this sequence here too. – ZelluX Aug 12 '13 at 02:47
  • Do you think it can be made linear space? – Simd Aug 12 '13 at 19:28
  • @user2179021 I cannot figure it out. But I've found another solution which runs faster in some cases. Please see the link in updated answer. – ZelluX Aug 13 '13 at 16:46
  • Thank you. I am not convinced the method in the paper is faster in practice yet however. I think the space is the main issue really (as well as being able to use knowledge about the likely gap sizes to speed things up). – Simd Aug 13 '13 at 16:54
11

Update: First algorithm described here is obsoleted by Armin Rigo's second answer, which is much simpler and more efficient. But both these methods have one disadvantage. They need many hours to find the result for one million integers. So I tried two more variants (see second half of this answer) where the range of input integers is assumed to be limited. Such limitation allows much faster algorithms. Also I tried to optimize Armin Rigo's code. See my benchmarking results at the end.


Here is an idea of algorithm using O(N) memory. Time complexity is O(N2 log N), but may be decreased to O(N2).

Algorithm uses the following data structures:

  1. prev: array of indexes pointing to previous element of (possibly incomplete) subsequence.
  2. hash: hashmap with key = difference between consecutive pairs in subsequence and value = two other hashmaps. For these other hashmaps: key = starting/ending index of the subsequence, value = pair of (subsequence length, ending/starting index of the subsequence).
  3. pq: priority queue for all possible "difference" values for subsequences stored in prev and hash.

Algorithm:

  1. Initialize prev with indexes i-1. Update hash and pq to register all (incomplete) subsequences found on this step and their "differences".
  2. Get (and remove) smallest "difference" from pq. Get corresponding record from hash and scan one of second-level hash maps. At this time all subsequences with given "difference" are complete. If second-level hash map contains subsequence length better than found so far, update the best result.
  3. In the array prev: for each element of any sequence found on step #2, decrement index and update hash and possibly pq. While updating hash, we could perform one of the following operations: add a new subsequence of length 1, or grow some existing subsequence by 1, or merge two existing subsequences.
  4. Remove hash map record found on step #2.
  5. Continue from step #2 while pq is not empty.

This algorithm updates O(N) elements of prev O(N) times each. And each of these updates may require to add a new "difference" to pq. All this means time complexity of O(N2 log N) if we use simple heap implementation for pq. To decrease it to O(N2) we might use more advanced priority queue implementations. Some of the possibilities are listed on this page: Priority Queues.

See corresponding Python code on Ideone. This code does not allow duplicate elements in the list. It is possible to fix this, but it would be a good optimization anyway to remove duplicates (and to find the longest subsequence beyond duplicates separately).

And the same code after a little optimization. Here search is terminated as soon as subsequence length multiplied by possible subsequence "difference" exceeds source list range.


Armin Rigo's code is simple and pretty efficient. But in some cases it does some extra computations that may be avoided. Search may be terminated as soon as subsequence length multiplied by possible subsequence "difference" exceeds source list range:

def findLESS(A):
  Aset = set(A)
  lmax = 2
  d = 1
  minStep = 0

  while (lmax - 1) * minStep <= A[-1] - A[0]:
    minStep = A[-1] - A[0] + 1
    for j, b in enumerate(A):
      if j+d < len(A):
        a = A[j+d]
        step = a - b
        minStep = min(minStep, step)
        if a + step in Aset and b - step not in Aset:
          c = a + step
          count = 3
          while c + step in Aset:
            c += step
            count += 1
          if count > lmax:
            lmax = count
    d += 1

  return lmax

print(findLESS([1, 4, 5, 7, 8, 12]))

If range of integers in source data (M) is small, a simple algorithm is possible with O(M2) time and O(M) space:

def findLESS(src):
  r = [False for i in range(src[-1]+1)]
  for x in src:
    r[x] = True

  d = 1
  best = 1

  while best * d < len(r):
    for s in range(d):
      l = 0

      for i in range(s, len(r), d):
        if r[i]:
          l += 1
          best = max(best, l)
        else:
          l = 0

    d += 1

  return best


print(findLESS([1, 4, 5, 7, 8, 12]))

It is similar to the first method by Armin Rigo, but it doesn't use any dynamic data structures. I suppose source data has no duplicates. And (to keep the code simple) I also suppose that minimum input value is non-negative and close to zero.


Previous algorithm may be improved if instead of the array of booleans we use a bitset data structure and bitwise operations to process data in parallel. The code shown below implements bitset as a built-in Python integer. It has the same assumptions: no duplicates, minimum input value is non-negative and close to zero. Time complexity is O(M2 * log L) where L is the length of optimal subsequence, space complexity is O(M):

def findLESS(src):
  r = 0
  for x in src:
    r |= 1 << x

  d = 1
  best = 1

  while best * d < src[-1] + 1:
    c = best
    rr = r

    while c & (c-1):
      cc = c & -c
      rr &= rr >> (cc * d)
      c &= c-1

    while c != 1:
      c = c >> 1
      rr &= rr >> (c * d)

    rr &= rr >> d

    while rr:
      rr &= rr >> d
      best += 1

    d += 1

  return best

Benchmarks:

Input data (about 100000 integers) is generated this way:

random.seed(42)
s = sorted(list(set([random.randint(0,200000) for r in xrange(140000)])))

And for fastest algorithms I also used the following data (about 1000000 integers):

s = sorted(list(set([random.randint(0,2000000) for r in xrange(1400000)])))

All results show time in seconds:

Size:                         100000   1000000
Second answer by Armin Rigo:     634         ?
By Armin Rigo, optimized:         64     >5000
O(M^2) algorithm:                 53      2940
O(M^2*L) algorithm:                7       711
Community
  • 1
  • 1
Evgeny Kluev
  • 24,287
  • 7
  • 55
  • 98
  • 2
    I have to admit I don't understand this solution. Can you provide some code? – Simd Aug 11 '13 at 20:17
  • @user2179021: Not sure it will help. That should be quite a lot of code. Not easy to write. Too complicated to understand. I still hope it's possible to get the idea from this short description. Probably you have questions about some particular step? – Evgeny Kluev Aug 11 '13 at 20:32
  • Even step 1 would be helpful. What do we set prev, next, hash and pq to be in the first step using the example 1, 4, 5, 7, 8, 12 in the question? – Simd Aug 11 '13 at 21:00
  • @user2179021: `prev` is [Nil,0,1,2,3,4]. `next` is [1,2,3,4,5,Nil]. `pq` will contain all adjacent differences: 1,2,3,4. Every subsequence has length 1. This value together with starting/ending positions are stored in `hash`: {1:({1:1,3:1},{2:1,4:1}), 2:({2:1},{3:1}), 3:({0:1},{1:1}), 4:({4:1},{5:1})}. – Evgeny Kluev Aug 11 '13 at 21:07
  • @user2179021: On second thought some code could be helpful. Added link to Ideone. Also removed now obsolete list 'next' from the explanation. PS. I suspect this code is far from perfect. I use Python only occasionally. – Evgeny Kluev Aug 12 '13 at 11:51
  • Thank you. Your optimized code is very fast but worryingly sometimes gives a different answer to Armin's code. For example try [2, 6, 7, 10, 13, 14, 17, 18, 21, 22, 23, 25, 28, 32, 39, 40, 41, 44, 45, 46, 49, 50, 51, 52, 53, 63, 66, 67, 68, 69, 71, 72, 74, 75, 76, 79, 80, 82, 86, 95, 97, 101, 110, 111, 112, 114, 115, 120, 124, 125, 129, 131, 132, 136, 137, 138, 139, 140, 144, 145, 147, 151, 153, 157, 159, 161, 163, 165, 169, 172, 173, 175, 178, 179, 182, 185, 186, 188, 195] . Your code gives 7 but Armin says found 8 items in 6 .. 139 . – Simd Aug 12 '13 at 18:01
  • In fact [0, 6, 7, 10, 11, 12, 16, 18, 19] has a subsequence of length 4 (0,6,12,18) but your code outputs 3. – Simd Aug 12 '13 at 18:17
  • @user2179021: Right, it was a little bit overoptimized. Now fixed. – Evgeny Kluev Aug 12 '13 at 18:31
  • @user2179021: Just found a bug in testing part of code I submitted yesterday. In `sorted(list(set(a)))` I forget `sorted`. It's hard to notice because after `list(set(a))` the list almost always looks like sorted. Also cleaned up the code and provided one more solution for the cases where value range is very limited. – Evgeny Kluev Aug 13 '13 at 16:43
  • This is very fast! Out of interest, how did you optimize Armin's code for "By Armin Rigo, optimized: " ? – Simd Aug 16 '13 at 11:04
  • @user2179021: I reordered calculations so that distance between elements increases with time; and terminated search when this distance is too large. All the details are already in the answer. – Evgeny Kluev Aug 16 '13 at 12:07
  • Beautiful and inspiring solution. – גלעד ברקן Aug 17 '13 at 13:58
  • Great solution :-) Random note, `sorted(list(set(a)))` is actually the same as `sorted(set(a))`. – Armin Rigo Aug 20 '13 at 10:06
  • @EvgenyKluev I noticed http://stackoverflow.com/questions/18241277/discover-long-patterns . Can you do something similarly clever for that? – Simd Aug 23 '13 at 21:32
  • @user2179021: That question already has a very impressive answer by jbaylina. Hard to invent anything better. – Evgeny Kluev Aug 24 '13 at 07:50
  • 1
    @EvgenyKluev Your approach here is better in at least three ways :) 1. Your solution is parallelised to make it really fast. 2. The time complexity is clear and does not depend on the size of the numbers. I am not 100 percent sure what the time complexity is of the proposed solution to the linked question. 3. If you have some idea about the likely range of 'd' your solutions is even faster. – Simd Aug 25 '13 at 18:35
  • @user2179021: (1) is completely true and that's the reason why I posted this solution here; (2) means also that instead of dependency on array size, this solution depends on values range - that's OK for "dense" data, but much worse when range is significantly greater than the array size (in linked question size=10^5 but range=10^7, which is not very good for this bitwise solution); (3) jbaylina's solution also may be optimized by making 'd' the outer loop variable, actually his solution allows many more optimizations... – Evgeny Kluev Aug 25 '13 at 19:30
  • So yes, my bitwise solution is applicable to the linked question, but it would not be better than the accepted one (it may be faster up to the range 10^6, then on range=10^7, most likely, it would be slower). Also I don't see any justification to post almost identical solution once more. – Evgeny Kluev Aug 25 '13 at 19:31
3

Algorithm

  • Main loop traversing the list
  • If number found in precalculate list, then it's belong to all sequences which are in that list, recalculate all the sequences with count + 1
  • Remove all precalculated for current element
  • Recalculate new sequences where first element is from range from 0 to current, and second is current element of traversal (actually, not from 0 to current, we can use the fact that new element shouldn't be more that max(a) and new list should have possibility to become longer that already found one)

So for list [1, 2, 4, 5, 7] output would be (it's a little messy, try code yourself and see)

  • index 0, element 1:
    • if 1 in precalc? No - do nothing
    • Do nothing
  • index 1, element 2:
    • if 2 in precalc? No - do nothing
    • check if 3 = 1 + (2 - 1) * 2 in our set? No - do nothing
  • index 2, element 4:
    • if 4 in precalc? No - do nothing
      • check if 6 = 2 + (4 - 2) * 2 in our set? No
      • check if 7 = 1 + (4 - 1) * 2 in our set? Yes - add new element {7: {3: {'count': 2, 'start': 1}}} 7 - element of the list, 3 is step.
  • index 3, element 5:
    • if 5 in precalc? No - do nothing
      • do not check 4 because 6 = 4 + (5 - 4) * 2 is less that calculated element 7
      • check if 8 = 2 + (5 - 2) * 2 in our set? No
      • check 10 = 2 + (5 - 1) * 2 - more than max(a) == 7
  • index 4, element 7:
    • if 7 in precalc? Yes - put it into result
      • do not check 5 because 9 = 5 + (7 - 5) * 2 is more than max(a) == 7

result = (3, {'count': 3, 'start': 1}) # step 3, count 3, start 1, turn it into sequence

Complexity

It shouldn't be more than O(N^2), and I think it's less because of earlier termination of searching new sequencies, I'll try to provide detailed analysis later

Code

def add_precalc(precalc, start, step, count, res, N):
    if step == 0: return True
    if start + step * res[1]["count"] > N: return False

    x = start + step * count
    if x > N or x < 0: return False

    if precalc[x] is None: return True

    if step not in precalc[x]:
        precalc[x][step] = {"start":start, "count":count}

    return True

def work(a):
    precalc = [None] * (max(a) + 1)
    for x in a: precalc[x] = {}
    N, m = max(a), 0
    ind = {x:i for i, x in enumerate(a)}

    res = (0, {"start":0, "count":0})
    for i, x in enumerate(a):
        for el in precalc[x].iteritems():
            el[1]["count"] += 1
            if el[1]["count"] > res[1]["count"]: res = el
            add_precalc(precalc, el[1]["start"], el[0], el[1]["count"], res, N)
            t = el[1]["start"] + el[0] * el[1]["count"]
            if t in ind and ind[t] > m:
                m = ind[t]
        precalc[x] = None

        for y in a[i - m - 1::-1]:
            if not add_precalc(precalc, y, x - y, 2, res, N): break

    return [x * res[0] + res[1]["start"] for x in range(res[1]["count"])]
Roman Pekar
  • 107,110
  • 28
  • 195
  • 197
  • 1
    it search for difference that appeared most of the times, right? it will find a lot of 1's in "1 2 5 6 100 101 1000 1001 1e5 1e5+2 1e5+4", while the best answer is with diff=2? – RiaD Aug 10 '13 at 10:17
  • actually my solution is not memory efficient, changing it anyway :) – Roman Pekar Aug 10 '13 at 11:04
  • Did you change anything? I didn't notice the difference – RiaD Aug 10 '13 at 12:05
  • not yet, have to go to do some sports, my mind actually working better when I'm running, so maybe this'll helps :) If I'll found solution I'll put it here. My current one is very naive, thanks for noticing – Roman Pekar Aug 10 '13 at 12:13
  • @RiaD - added an algorithm. It's still quadratic (I think), I ran it on 10000 points and it was ok, but on 100000 it's already too long. So it will not work blazingly fast on 1M rows. But I think actually, Armin Rigo's algorithm is also couldn't manage so much. I'll try more, still trying to understand if O(N) is possible, using the face that list is sorted – Roman Pekar Aug 11 '13 at 07:05
  • 3
    In the future, I highly recommend not deleting and re-posting an answer if it is accidentally labeled as Community Wiki. We can easily reverse that status, as I've done here. – Brad Larson Aug 11 '13 at 15:13
3

Here is another answer, working in time O(n^2) and without any notable memory requirements beyond that of turning the list into a set.

The idea is quite naive: like the original poster, it is greedy and just checks how far you can extend a subsequence from each pair of points --- however, checking first that we're at the start of a subsequence. In other words, from points a and b you check how far you can extend to b + (b-a), b + 2*(b-a), ... but only if a - (b-a) is not already in the set of all points. If it is, then you already saw the same subsequence.

The trick is to convince ourselves that this simple optimization is enough to lower the complexity to O(n^2) from the original O(n^3). That's left as an exercice to the reader :-) The time is competitive with other O(n^2) solutions here.

A = [1, 4, 5, 7, 8, 12]    # in sorted order
Aset = set(A)

lmax = 2
for j, b in enumerate(A):
    for i in range(j):
        a = A[i]
        step = b - a
        if b + step in Aset and a - step not in Aset:
            c = b + step
            count = 3
            while c + step in Aset:
                c += step
                count += 1
            #print "found %d items in %d .. %d" % (count, a, c)
            if count > lmax:
                lmax = count

print lmax
Armin Rigo
  • 12,048
  • 37
  • 48
2

Your solution is O(N^3) now (you said O(N^2) per index). Here it is O(N^2) of time and O(N^2) of memory solution.

Idea

If we know subsequence that goes through indices i[0],i[1],i[2],i[3] we shouldn't try subsequence that starts with i[1] and i[2] or i[2] and i[3]

Note I edited that code to make it a bit easier using that a sorted but it will not work for equal elements. You may check number max number of equal elements in O(N) easily

Pseudocode

I'm seeking only for max length but that doesn't change anything

whereInA = {}
for i in range(n):
   whereInA[a[i]] = i; // It doesn't matter which of same elements it points to

boolean usedPairs[n][n];

for i in range(n):
    for j in range(i + 1, n):
       if usedPair[i][j]:
          continue; // do not do anything. It was in one of prev sequences.

    usedPair[i][j] = true;

    //here quite stupid solution:
    diff = a[j] - a[i];
    if diff == 0:
       continue; // we can't work with that
    lastIndex = j
    currentLen = 2
    while whereInA contains index a[lastIndex] + diff :
        nextIndex = whereInA[a[lastIndex] + diff]
        usedPair[lastIndex][nextIndex] = true
        ++currentLen
        lastIndex = nextIndex

    // you may store all indicies here
    maxLen = max(maxLen, currentLen)

Thoughts about memory usage

O(n^2) time is very slow for 1000000 elements. But if you are going to run this code on such number of elements the biggest problem will be memory usage.
What can be done to reduce it?

  • Change boolean arrays to bitfields to store more booleans per bit.
  • Make each next boolean array shorter because we only use usedPairs[i][j] if i < j

Few heuristics:

  • Store only pairs of used indicies. (Conflicts with the first idea)
  • Remove usedPairs that will never used more (that are for such i,j that was already chosen in the loop)
RiaD
  • 46,822
  • 11
  • 79
  • 123
1

This is my 2 cents.

If you have a list called input:

input = [1, 4, 5, 7, 8, 12]

You can build a data structure that for each one of this points (excluding the first one), will tell you how far is that point from anyone of its predecessors:

[1, 4, 5, 7, 8, 12]
 x  3  4  6  7  11   # distance from point i to point 0
 x  x  1  3  4   8   # distance from point i to point 1
 x  x  x  2  3   7   # distance from point i to point 2
 x  x  x  x  1   5   # distance from point i to point 3
 x  x  x  x  x   4   # distance from point i to point 4

Now that you have the columns, you can consider the i-th item of input (which is input[i]) and each number n in its column.

The numbers that belong to a series of equidistant numbers that include input[i], are those which have n * j in the i-th position of their column, where j is the number of matches already found when moving columns from left to right, plus the k-th predecessor of input[i], where k is the index of n in the column of input[i].

Example: if we consider i = 1, input[i] = 4, n = 3, then, we can identify a sequence comprehending 4 (input[i]), 7 (because it has a 3 in position 1 of its column) and 1, because k is 0, so we take the first predecessor of i.

Possible implementation (sorry if the code is not using the same notation as the explanation):

def build_columns(l):
    columns = {}
    for x in l[1:]:
        col = []
        for y in l[:l.index(x)]:
            col.append(x - y)
        columns[x] = col
    return columns

def algo(input, columns):
    seqs = []
    for index1, number in enumerate(input[1:]):
        index1 += 1 #first item was sliced
        for index2, distance in enumerate(columns[number]):
            seq = []
            seq.append(input[index2]) # k-th pred
            seq.append(number)
            matches = 1
            for successor in input[index1 + 1 :]:
                column = columns[successor]
                if column[index1] == distance * matches:
                    matches += 1
                    seq.append(successor)
            if (len(seq) > 2):
                seqs.append(seq)
    return seqs

The longest one:

print max(sequences, key=len)
Vincenzo Pii
  • 18,961
  • 8
  • 39
  • 49
  • Your algorith worked 2 seconds on 300 points, my algorithm worked 0.03 seconds. I've tried yours on 5000 but it was too long :( my worked 18 seconds on 5000, so it's still no way we could do this for 1000000 fast – Roman Pekar Aug 11 '13 at 07:15
  • yes, but may be we're missing some point and possible to do it linarly :) or NlogN at least.... – Roman Pekar Aug 11 '13 at 07:42
0

Traverse the array, keeping a record of the optimal result/s and a table with

(1) index - the element difference in the sequence,
(2) count - number of elements in the sequence so far, and
(3) the last recorded element.

For each array element look at the difference from each previous array element; if that element is last in a sequence indexed in the table, adjust that sequence in the table, and update the best sequence if applicable, otherwise start a new sequence, unless the current max is greater than the length of the possible sequence.

Scanning backwards we can stop our scan when d is greater than the middle of the array's range; or when the current max is greater than the length of the possible sequence, for d greater than the largest indexed difference. Sequences where s[j] is greater than the last element in the sequence are deleted.

I converted my code from JavaScript to Python (my first python code):

import random
import timeit
import sys

#s = [1,4,5,7,8,12]
#s = [2, 6, 7, 10, 13, 14, 17, 18, 21, 22, 23, 25, 28, 32, 39, 40, 41, 44, 45, 46, 49, 50, 51, 52, 53, 63, 66, 67, 68, 69, 71, 72, 74, 75, 76, 79, 80, 82, 86, 95, 97, 101, 110, 111, 112, 114, 115, 120, 124, 125, 129, 131, 132, 136, 137, 138, 139, 140, 144, 145, 147, 151, 153, 157, 159, 161, 163, 165, 169, 172, 173, 175, 178, 179, 182, 185, 186, 188, 195]
#s = [0, 6, 7, 10, 11, 12, 16, 18, 19]

m = [random.randint(1,40000) for r in xrange(20000)]
s = list(set(m))
s.sort()

lenS = len(s)
halfRange = (s[lenS-1] - s[0]) // 2

while s[lenS-1] - s[lenS-2] > halfRange:
    s.pop()
    lenS -= 1
    halfRange = (s[lenS-1] - s[0]) // 2

while s[1] - s[0] > halfRange:
    s.pop(0)
    lenS -=1
    halfRange = (s[lenS-1] - s[0]) // 2

n = lenS

largest = (s[n-1] - s[0]) // 2
#largest = 1000 #set the maximum size of d searched

maxS = s[n-1]
maxD = 0
maxSeq = 0
hCount = [None]*(largest + 1)
hLast = [None]*(largest + 1)
best = {}

start = timeit.default_timer()

for i in range(1,n):

    sys.stdout.write(repr(i)+"\r")

    for j in range(i-1,-1,-1):
        d = s[i] - s[j]
        numLeft = n - i
        if d != 0:
            maxPossible = (maxS - s[i]) // d + 2
        else:
            maxPossible = numLeft + 2
        ok = numLeft + 2 > maxSeq and maxPossible > maxSeq

        if d > largest or (d > maxD and not ok):
            break

        if hLast[d] != None:
            found = False
            for k in range (len(hLast[d])-1,-1,-1):
                tmpLast = hLast[d][k]
                if tmpLast == j:
                    found = True
                    hLast[d][k] = i
                    hCount[d][k] += 1
                    tmpCount = hCount[d][k]
                    if tmpCount > maxSeq:
                        maxSeq = tmpCount
                        best = {'len': tmpCount, 'd': d, 'last': i}
                elif s[tmpLast] < s[j]:
                    del hLast[d][k]
                    del hCount[d][k]
            if not found and ok:
                hLast[d].append(i)
                hCount[d].append(2)
        elif ok:
            if d > maxD: 
                maxD = d
            hLast[d] = [i]
            hCount[d] = [2]


end = timeit.default_timer()
seconds = (end - start)

#print (hCount)
#print (hLast)
print(best)
print(seconds)
גלעד ברקן
  • 23,602
  • 3
  • 25
  • 61
  • so you still need O(N^2) memory – Roman Pekar Aug 10 '13 at 16:26
  • @RomanPekar Thank you for your comment. What about the added optimization that terminates the backward scan early? – גלעד ברקן Aug 10 '13 at 17:10
  • @RomanPeckar no, just thinking about the algorithm; also I don't know Python. – גלעד ברקן Aug 10 '13 at 18:22
  • @RomanPeckar added jsfiddle example to my answer. http://jsfiddle.net/groovy/b6zkR/ – גלעד ברקן Aug 10 '13 at 21:48
  • well I think there'are many O(N^2) solution, I've coded an example like this today. I still don't know if there's good solution which could run for 1000000 numbers... – Roman Pekar Aug 10 '13 at 21:51
  • @RomanPekar Can I ask you, when you say O(N^2) memory in your comment on my answer, do you mean that at some point during the algorithm, N^2 of information needs to be stored in memory; or do you mean that in total, by the time the algorithm is finished, N^2 memory will have been used. I am asking because my hash idea, although it runs slowly, because the table is constantly being emptied, will use much much less memory than N^2 at any given moment. Actually, if you only want one optimal answer, the table can run in pretty much constant space. – גלעד ברקן Aug 12 '13 at 10:15
  • I mean N^2 at some point. Take a look at my answer, I think I've used similar idea to your "hashing", but I've made a precalculation of next elements in sequence. I'll try to do some speed testing today. – Roman Pekar Aug 12 '13 at 10:19
  • I tried two other ideas (all in the world's slowest language...:); one was a simple loop with one `bestSequence` variable, the other was a DP-style N-long array with the best sequences so-far (best up to s[i], best up tp s[i+1], etc.) that would be emptied as well. But they were both slower than this hash. I like your precalculation idea, although I need to learn more python first to read it. – גלעד ברקן Aug 12 '13 at 10:27
  • Python is not *so* bad with pypy :) – Simd Aug 12 '13 at 18:25
  • @user2179021 oh, by 'the world's slowest language' I meant JavaScript :) – גלעד ברקן Aug 12 '13 at 18:32
0

This is a particular case for the more generic problem described here: Discover long patterns where K=1 and is fixed. It is demostrated there that it can be solved in O(N^2). Runnig my implementation of the C algorithm proposed there it takes 3 seconds to find the solution for N=20000 and M=28000 in my 32bit machine.

Community
  • 1
  • 1
jbaylina
  • 4,408
  • 1
  • 30
  • 39
0

Greedy method
1 .Only one sequence of decision is generated.
2. Many number of decisions are generated. Dynamic programming 1. It does not guarantee to give an optimal solution always.
2. It definitely gives an optimal solution.