Finding the subsequence that starts and ends with the same number with the maximum sum

Question

I have to make a program that takes as input a list of numbers and returns the sum of the subsequence that starts and ends with the same number which has the maximum sum (including the equal numbers in the beginning and end of the subsequence in the sum). It also has to return the placement of the start and end of the subsequence, that is, their index+1. The problem is that my current code runs smoothly only while the length of list is not that long. When the list length extends to 5000 the program does not give an answer.

The input is the following:

6
3 2 4 3 5 6

The first line is for the length of the list. The second line is the list itself, with list items separated by space. The output will be 12, 1, 4, because as you can see there is 1 equal pair of numbers (3): the first and fourth element, so the sum of elements between them is 3 + 2 + 4 + 3 = 12, and their placement is first and fourth.

Here is my code.

length = int(input())
mass = raw_input().split()
for i in range(length):
    mass[i]=int(mass[i])
value=-10000000000

b = 1
e = 1
for i in range(len(mass)):
    if mass[i:].count(mass[i])!=1:
      for j in range(i,len(mass)):
        if mass[j]==mass[i]:
            f = mass[i:j+1]
            if sum(f)>value:
                value = sum(f)
                b = i+1
                e = j+1
    else:
        if mass[i]>value:
            value = mass[i]
            b = i+1
            e = i+1
print value
print b,e

Sorry, I do not understand what your program is supposed to do. Can you clarify your description, for example by showing it for a simple toy input? (You can [edit] your post to that effect) — akraf, Jan 02 '18 at 19:24
So for that sample input of `3 5 6 3 5 4` your code should return 19 = 5+6+3+5, and also (2, 5), because the start and end of that sequence is at (1, 4). Is that correct? — PM 2Ring, Jan 02 '18 at 19:28
Your algorithm has a "big-O notation" of O(n^2) which means bigger input will be exponentially slower. If this is a beginning programming course, that's probably the point they're trying to teach you. Look for a different approach to the problem that doesn't involve doubly nested for loops. — dwilliss, Jan 02 '18 at 20:44

PM 2Ring · Accepted Answer · 2018-01-04T17:54:36.220

This should be faster than your current approach.

Rather than searching through mass looking for pairs of matching numbers we pair each number in mass with its index and sort those pairs. We can then use groupby to find groups of equal numbers. If there are more than 2 of the same number we use the first and last, since they will have the greatest sum between them.

from operator import itemgetter
from itertools import groupby

raw = '3 5 6 3 5 4'
mass = [int(u) for u in raw.split()]

result = []
a = sorted((u, i) for i, u in enumerate(mass))
for _, g in groupby(a, itemgetter(0)):
    g = list(g)
    if len(g) > 1:
        u, v = g[0][1], g[-1][1]
        result.append((sum(mass[u:v+1]), u+1, v+1))
print(max(result))

output

(19, 2, 5)

Note that this code will not necessarily give the maximum sum between equal elements in the list if the list contains negative numbers. It will still work correctly with negative numbers if no group of equal numbers has more than two members. If that's not the case, we need to use a slower algorithm that tests every pair within a group of equal numbers.

Here's a more efficient version. Instead of using the sum function we build a list of the cumulative sums of the whole list. This doesn't make much of a difference for small lists, but it's much faster when the list size is large. Eg, for a list of 10,000 elements this approach is about 10 times faster. To test it, I create an array of random positive integers.

from operator import itemgetter
from itertools import groupby
from random import seed, randrange

seed(42)

def maxsum(seq):
    total = 0
    sums = [0]
    for u in seq:
        total += u
        sums.append(total)
    result = []
    a = sorted((u, i) for i, u in enumerate(seq))
    for _, g in groupby(a, itemgetter(0)):
        g = list(g)
        if len(g) > 1:
            u, v = g[0][1], g[-1][1]
            result.append((sums[v+1] - sums[u], u+1, v+1))
    return max(result)

num = 25000
hi = num // 2
mass = [randrange(1, hi) for _ in range(num)]
print(maxsum(mass))

output

(155821402, 21, 24831)

If you're using a recent version of Python you can use itertools.accumulate to build the list of cumulative sums. This is around 10% faster.

from itertools import accumulate

def maxsum(seq):
    sums = [0] + list(accumulate(seq))
    result = []
    a = sorted((u, i) for i, u in enumerate(seq))
    for _, g in groupby(a, itemgetter(0)):
        g = list(g)
        if len(g) > 1:
            u, v = g[0][1], g[-1][1]
            result.append((sums[v+1] - sums[u], u+1, v+1))
    return max(result)

Here's a faster version, derived from code by Stefan Pochmann, which uses a dict, instead of sorting & groupby. Thanks, Stefan!

def maxsum(seq):
    total = 0
    sums = [0]
    for u in seq:
        total += u
        sums.append(total)
    where = {}
    for i, x in enumerate(seq, 1):
        where.setdefault(x, [i, i])[1] = i
    return max((sums[j] - sums[i - 1], i, j)
        for i, j in where.values())

If the list contains no duplicate items (and hence no subsequences bound by duplicate items) it returns the maximum item in the list.

Here are two more variations. These can handle negative items correctly, and if there are no duplicate items they return None. In Python 3 that could be handled elegantly by passing default=None to max, but that option isn't available in Python 2, so instead I catch the ValueError exception that's raised when you attempt to find the max of an empty iterable.

The first version, maxsum_combo, uses itertools.combinations to generate all combinations of a group of equal numbers and thence finds the combination that gives the maximum sum. The second version, maxsum_kadane uses a variation of Kadane's algorithm to find the maximum subsequence within a group.

If there aren't many duplicates in the original sequence, so the average group size is small, maxsum_combo is generally faster. But if the groups are large, then maxsum_kadane is much faster than maxsum_combo. The code below tests these functions on random sequences of 15000 items, firstly on sequences with few duplicates (and hence small mean group size) and then on sequences with lots of duplicates. It verifies that both versions give the same results, and then it performs timeit tests.

from __future__ import print_function
from itertools import groupby, combinations
from random import seed, randrange
from timeit import Timer

seed(42)

def maxsum_combo(seq):
    total = 0
    sums = [0]
    for u in seq:
        total += u
        sums.append(total)
    where = {}
    for i, x in enumerate(seq, 1):
        where.setdefault(x, []).append(i)
    try:
        return max((sums[j] - sums[i - 1], i, j)
            for v in where.values() for i, j in combinations(v, 2))
    except ValueError:
        return None

def maxsum_kadane(seq):
    total = 0
    sums = [0]
    for u in seq:
        total += u
        sums.append(total)
    where = {}
    for i, x in enumerate(seq, 1):
        where.setdefault(x, []).append(i)
    try:
        return max(max_sublist([(sums[j] - sums[i-1], i, j)
            for i, j in zip(v, v[1:])], k) 
                for k, v in where.items() if len(v) > 1)
    except ValueError:
        return None

# Kadane's Algorithm to find maximum sublist
# From https://en.wikipedia.org/wiki/Maximum_subarray_problem
def max_sublist(seq, k):
    max_ending_here = max_so_far = seq[0]
    for x in seq[1:]:
        y = max_ending_here[0] + x[0] - k, max_ending_here[1], x[2]
        max_ending_here = max(x, y)
        max_so_far = max(max_so_far, max_ending_here)
    return max_so_far

def test(num, hi, loops):
    print('\nnum = {0}, hi = {1}, loops = {2}'.format(num, hi, loops))
    print('Verifying...')
    for k in range(5):
        mass = [randrange(-hi // 2, hi) for _ in range(num)]
        a = maxsum_combo(mass)
        b = maxsum_kadane(mass)
        print(a, b, a==b)

    print('\nTiming...')
    for func in maxsum_combo, maxsum_kadane:
        t = Timer(lambda: func(mass))
        result = sorted(t.repeat(3, loops))
        result = ', '.join([format(u, '.5f') for u in result])
        print('{0:14} : {1}'.format(func.__name__, result))

loops = 20
num = 15000
hi = num // 4
test(num, hi, loops)

loops = 10
hi = num // 100
test(num, hi, loops)

output

num = 15000, hi = 3750, loops = 20
Verifying...
(13983131, 44, 14940) (13983131, 44, 14940) True
(13928837, 27, 14985) (13928837, 27, 14985) True
(14057416, 40, 14995) (14057416, 40, 14995) True
(13997395, 65, 14996) (13997395, 65, 14996) True
(14050007, 12, 14972) (14050007, 12, 14972) True

Timing...
maxsum_combo   : 1.72903, 1.73780, 1.81138
maxsum_kadane  : 2.17738, 2.22108, 2.22394

num = 15000, hi = 150, loops = 10
Verifying...
(553789, 21, 14996) (553789, 21, 14996) True
(550174, 1, 14992) (550174, 1, 14992) True
(551017, 13, 14991) (551017, 13, 14991) True
(554317, 2, 14986) (554317, 2, 14986) True
(558663, 15, 14988) (558663, 15, 14988) True

Timing...
maxsum_combo   : 7.29226, 7.34213, 7.36688
maxsum_kadane  : 1.07532, 1.07695, 1.10525

This code runs on both Python 2 and Python 3. The above results were generated on an old 32 bit 2GHz machine running Python 2.6.6 on a Debian derivative of Linux. The speeds for Python 3.6.0 are similar.

If you want to include groups that consist of a single non-repeated number, and also want to include the numbers that are in groups as a "subsequence" of length 1, you can use this version:

def maxsum_kadane(seq):
    if not seq:
        return None
    total = 0
    sums = [0]
    for u in seq:
        total += u
        sums.append(total)
    where = {}
    for i, x in enumerate(seq, 1):
        where.setdefault(x, []).append(i)
    # Find the maximum of the single items
    m_single = max((k, v[0], v[0]) for k, v in where.items())
    # Find the maximum of the subsequences
    try:
        m_subseq = max(max_sublist([(sums[j] - sums[i-1], i, j)
                for i, j in zip(v, v[1:])], k) 
                    for k, v in where.items() if len(v) > 1)
        return max(m_single, m_subseq)
    except ValueError:
        # No subsequences
        return m_single

I haven't tested it extensively, but it should work. ;)

@StefanPochmann Maybe after I wake up. :) Sure, putting `sum` inside that loop makes it O(n²), but it's very fast compared to manually summing items, so the run time shouldn't be too bad unless the list size is large. FWIW, I did a bunch of timeit tests recently where `sum`-based O(n²) algorithms were able to out-perform O(nlogn) algorithms for n<500. See https://stackoverflow.com/a/47845960/4014959 — PM 2Ring, Jan 02 '18 at 21:15
Woah, that's a long one. Maybe I'll read it after I wake up :D. Yes, I'm well aware that fast O(n) can beat slow O(log n), I've used that quite a few times myself. But your fast O(n) won't beat O(**1**) that easily. — Stefan Pochmann, Jan 02 '18 at 21:20
@PM2Ring first, thank for your effort. I tried this code on an input list of 1 2 1 2 3, but it returns 4, 1, 3 while the answer should be 5,2,4 — Kang MinSon, Jan 02 '18 at 21:41
@KangMinSon That's odd. I just ran my code with `raw = '1 2 1 2 3'` and it prints `(5, 2, 4)`. — PM 2Ring, Jan 03 '18 at 08:09
@StefanPochmann I've eliminated `sum`, but I don't know how to make it linear. I guess it's now O(nlogn), which is the worst-case running time for TimSort, since all the other steps are O(n), assuming that `groupby` is linear. — PM 2Ring, Jan 03 '18 at 09:00
Well you're doing the sorting only for the grouping, right? You could group with a dictionary instead, surely you've done that many times before :-). Here's my version: https://ideone.com/YeMdaN — Stefan Pochmann, Jan 03 '18 at 09:59
@StefanPochmann Ah, of course! I guess I got "locked in" to using `groupby`, but a `dict` makes a lot more sense here. You should post an answer. — PM 2Ring, Jan 03 '18 at 10:17
Mmh, the OP mentioned negative numbers under Prune's answer, so I wouldn't want to post an answer now ignoring that but I also wouldn't want to cover it if it's not in the question... — Stefan Pochmann, Jan 03 '18 at 10:57
@StefanPochmann Fair enough. Handling negatives efficiently is tricky. I suppose I could look at all combinations in each group. That shouldn't be too slow if the group size is smallish. — PM 2Ring, Jan 03 '18 at 11:04
You mean quadratic for each group? Nah, I think it's still easy to do each group and thus the whole problem in linear time. — Stefan Pochmann, Jan 03 '18 at 11:48
@StefanPochmann Ok, I finally figured out how to do each group in linear time. Thanks for the encouragement. — PM 2Ring, Jan 03 '18 at 16:04
@PM2Ring I didn't really want to bother you anymore with this question, but still, how this code can be modified, so that single element is also counted as the start and the end of the subarray. — Kang MinSon, Jan 04 '18 at 13:24
@KangMinSon It would have been better if you'd mentioned that option in the original question... But anyway, please see my update. I'll let you figure out how to modify `maxsum_combo`. ;) — PM 2Ring, Jan 04 '18 at 14:18
@PM2Ring I have tried test input of list [-1, -1, -1], but the program returns -2, 2, 3, while it should return -1, 1, 1 or -1,2,2 or -1,3,3 — Kang MinSon, Jan 04 '18 at 17:07
@KangMinSon I see. I didn't realise that you wanted to add single elements that are also parts of repeated groups. Let me think about it, but I may not post a solution for a while, it's 4:22AM in my timezone. — PM 2Ring, Jan 04 '18 at 17:23
@PM2Ring Oh, I am so grateful for your help. While you sleep i'll try to find the solution by myself; though, you did 99% of the work, it is my task after all) — Kang MinSon, Jan 04 '18 at 17:35
@PM2Ring ... I strongly recommend you to sleep at 4 AM... But anyway THANK YOU!! — Kang MinSon, Jan 04 '18 at 18:12

Finding the subsequence that starts and ends with the same number with the maximum sum

1 Answers1