3

The request from HackerRank:

If the amount spent by a client on a particular day is greater than or equal to 2× the client's median spending for a trailing number of days, they send the client a notification about potential fraud. The bank doesn't send the client any notifications until they have at least that trailing number of prior days' transaction data.

Given the number of trailing days d and a client's total daily expenditures for a period of n days, determine the number of times the client will receive a notification over all n days.

My code can solve the problem, but there is time limit so for big test cases. My code can not pass the time limit requirements. My code is short actually:

from statistics import median

first_multiple_input = input().rstrip().split()
n = int(first_multiple_input[0])
d = int(first_multiple_input[1])
expenditure = list(map(int, input().rstrip().split()))
count=0
for i in range(len(expenditure)-d):
    if expenditure[d+i] >= 2*median(expenditure[i:d+i]) :
        count+=1
print( count)

Please point out the reasons that cause the delay and how to improve them.

Small test case to help understand the code:

9 5                 expenditure[] size n =9, d = 5
2 3 4 2 3 6 8 4 5   expenditure = [2, 3, 4, 2, 3, 6, 8, 4, 5]
no comment
  • 6,381
  • 4
  • 12
  • 30
  • 1
    It's more algorithmic problem than execution time problem. I think dynamic programming approach can help you, though I might be wrong. Eventually you will need to somehow go over less iterations. Also median function might be slow. What is the time limit and on what time does it fail ? – Wahalez Sep 22 '21 at 09:37
  • @Wahalez, hi, it requires 10second, I have checked for small test case it is '10.201635599136353' second is it is fine. big test cases has 200000 expediteurs, and 10122 trailing days for example. I am looking for reasons through google. I also suspect median function is the problem. – Zhengxi Jiang Sep 22 '21 at 09:40
  • 3
    You should profile your script and see where it is spending most of its time. See [How can you profile a Python script?](https://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script). I'm guessing the time-consuming part is probably the `for`-loop and the repeated calculation of the median. If you look at the [source code](https://github.com/python/cpython/blob/3.9/Lib/statistics.py#L414) of the function you can see it sorts the data everytime it's called, which is relatively expensive — so avoiding that step somehow would be a good tactic. – martineau Sep 22 '21 at 09:42
  • thank you @martineau, at lease I know the reason, now working on solution. this is a new kind of problem for me. – Zhengxi Jiang Sep 22 '21 at 09:44
  • Essentially the median of a "sliding window" of the data is being calculated. For a really large dataset, it might be worthwhile to use a fixed-size [`collections.deque`](https://docs.python.org/3/library/collections.html#collections.deque) for this "window" because it's optimized for adding and removing elements from either of its ends. – martineau Sep 22 '21 at 09:57

1 Answers1

5

Analysis / Idea

Your median(expenditure[i:d+i]) is the culprit, as that takes O(d log d) time for sorting the entire unsorted slice of size d every time. You can reduce it to O(log d) by keeping the current window of trailing elements for example in a SortedList. You get the median from the middle element or two, and to update, just add one new element and remove the oldest element.

Implementation

from sortedcontainers import SortedList

n = 9
d = 5
expenditure = [2, 3, 4, 2, 3, 6, 8, 4, 5]

count = 0
trailing = SortedList(expenditure[:d])
half = d // 2
for i in range(d, n):
    median = (trailing[half] + trailing[~half]) / 2
    if expenditure[i] >= 2 * median:
        count += 1
    trailing.add(expenditure[i])
    trailing.remove(expenditure[i - d])
print(count)

We could leave out the / 2 and the 2 *, but then "median" would be the wrong name, and naming things is hard. We could do if expenditure[i] >= trailing[half] + trailing[~half], but I find it less clear.

Output

If you add

    print(f'{trailing=} {median=} {expenditure[i]=}')

after the median = ... line, you can see what's going on:

trailing=SortedList([2, 2, 3, 3, 4]) median=3.0 expenditure[i]=6
trailing=SortedList([2, 3, 3, 4, 6]) median=3.0 expenditure[i]=8
trailing=SortedList([2, 3, 4, 6, 8]) median=4.0 expenditure[i]=4
trailing=SortedList([2, 3, 4, 6, 8]) median=4.0 expenditure[i]=5
2

Alternative implementation

Using zip instead of indexes:

count = 0
trailing = SortedList(expenditure[:d])
half = d // 2
for today, oldest in zip(expenditure[d:], expenditure):
    median = (trailing[half] + trailing[~half]) / 2
    if today >= 2 * median:
        count += 1
    trailing.add(today)
    trailing.remove(oldest)
print(count)

Alternative data structure: sorted regular list

I found the problem at HackerRank, which doesn't have sortedcontainers. But the following gets accepted there.

We can use a regular Python list but keep it sorted ourselves with the help of sorted and bisect included in Python's standard library:

from bisect import bisect_left, insort

count = 0
trailing = sorted(expenditure[:d])
half = d // 2
for today, oldest in zip(expenditure[d:], expenditure):
    median = (trailing[half] + trailing[~half]) / 2
    if today >= 2 * median:
        count += 1
    insort(trailing, today)
    del trailing[bisect_left(trailing, oldest)]
print(count)

This takes O(1) time for accessing the middle element(s), O(log d) for finding the insertion/deletion indexes, and O(d) time for the actual insertion/deletion (since it needs to shift all elements on the right of the index). But that O(d) shifting is very fast very low level.

Two more: sorted bytearray and counting sort

The question originally didn't include the reference to HackerRank. Now that I see that the values are limited to integers from 0 to 200, we can also use a bytearray:

trailing = bytearray(sorted(expenditure[:d]))

And as I just saw pointed out in the discussions there, for this range of allowed values we can also use a form of counting sort. I think a Fenwick tree would make this particularly fast, I might try that later.

Benchmark

In the comments you mentioned n=200000 and d=10122 as a large case. So I tested with this data:

n = 200000
d = 10122
expenditure = random.choices(range(201), k=n)

Benchmark of my solutions:

                       at replit.com   on my weak laptop
SortedList + indexing   ~1.8 seconds    ~6.4 seconds
SortedList + zipping    ~1.8 seconds    ~6.4 seconds
sorted regular list     ~0.6 seconds    ~8.8 seconds
sorted bytearray        ~0.3 seconds    ~1.7 seconds

Not sure why the regular list solution was so relatively slow on my laptop. I suspect it exceeds my CPU's level 1 caches.

no comment
  • 6,381
  • 4
  • 12
  • 30
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/237392/discussion-on-answer-by-dont-talk-just-code-where-can-i-improve-my-code-to-shor). – Samuel Liew Sep 23 '21 at 01:30