I recently failed a coding challenge which deals with time complexity. I've been tinkering with it in my spare time but still can't get it to work quickly for large lists. I had initially over-thought the problem, refactored it, etc., made some incremental improvements, tried using pandas
(which turned out to be much slower), etc.
I am interested in learning what methods I might use to improve the execution speed of this code.
Input: a list with maximum size
10**6
containing unsorted integers inrange(1,10**5)
.The task is to compute the "total price" from this arbitrary construct and return the "total price" and an ordered list of indices representing those items that were not discounted.
An item at index
i
has its price discounted by the next smaller/lower item. If there are no smaller values initems[i+1:]
, the item's price is not discounted (or you can consider it discounted by0
).Example Input:
items = [5, 3, 4, 1, 5]
Example Output:
13, [3, 4]
Here,
items[0]
is discounted byitems[1]
,items[1]
is discounted byitems[3]
,items[2]
is discounted byitems[3]
,items[3]
anditems[4]
are not discounted.So the total price is
13
, given by(5-3) + (3-1) + (4-1) + (1-0) + (5-0)
I have a function that solves this pretty quickly for most cases, but as we start to approach the maximum size of the list, it's taking much longer. For example, a list of length 50000 is processed in < 1 second. A list of length 100K is processed in <3 seconds. A list of length 200K takes <10 seconds, and 400K takes about 50 seconds. Running against a million items takes ~1000+ seconds.
For testing, I create a large list like so and then I pass it (or slices of it) to the functions, like:
data = list(np.array(np.random.randint(1,10**5,(10**6)), dtype='int64'))
total, full_price = get_total(data[:100000])
Here is the faster, non-pandas
function:
def get_total(data):
init_total = sum(data)
items = data[:]
size = len(items)
discount = [get_discount(items.pop(0),items) for i in range(size)]
full = [i for (i,v) in enumerate(discount) if v == 0]
total = init_total - sum(discount)
return total, full, None
def get_discount(this, _items):
next_lowest_index, discount = next(((x,val) for x, val in enumerate(_items) if val < this), (np.NaN, 0))
return discount
I mentioned I had tried pandas
as well, but this code is much slower even on small lists (n=1000). I tried sorting it by value:
def frame_total(data):
if type(data) == list:
data = pd.DataFrame(data)
data = data[:].sort_values(0, 'index')
df = pd.DataFrame({ 'val':data[0],
'discount': [0] * data.shape[0]
}, dtype='int')
df.discount = [next(iter(df.loc[(df.index > i) & (df.val < row.val)].sort_index().val),0)
for i,row in df.iterrows()]
total = data.sum() - df.discount.sum()
full_indices = list(df[df.discount == 0].sort_index().index)
return total, full_indices, None
And another that does not sort the input data which is not perceptibly faster:
def frame2(data):
if type(data) == list:
data = pd.DataFrame(data)
data = data[:]
df = pd.DataFrame({ 'val':data[0],
'discount': [0] * data.shape[0]
}, dtype='int')
df.discount = [next(iter(df.val[i+1:].loc[df.val < row.val]),0) for i,row in df.iterrows()]
total = data.sum() - df.discount.sum()
full_indices = list(df[df.discount == 0].index)
return total, full_indices, None
Note that the full-price items are more likely to exist towards the end of the list (as i
increases, probability that any value < items[i]
exists in items[i+1:]
decreases). I feel like this is important, but I can't grok how to make use of that.
Solved, thanks @DarrylG and to the explanation here
def get_next_smallest(data,default=0):
"""
returns the discounted value for all items in a list
discounted value is the next smaller item in the list, e.g.:
for any n, the next smallest item is the first item in data[n+1:] < data[n]
provides O(n) complexity solution.
"""
discounts=[default for i in data] # stores the corresponding next smaller value
stack = [] # initialize our empty stack
for i, this in enumerate(data):
while len(stack) > 0 and this < data[stack[-1]]:
discounts[stack.pop()] = this
stack.append(i)
return discounts
def get_total(data):
init_total = sum(data)
default = 0 # should be a value that will NOT be present in the data, like 0 or -1
discounts = get_next_smallest(data, default)
full = [i for i,v in enumerate(discounts) if v == default]
total = init_total - sum(discounts)
return total, full