1

I have a list and need to find two lists in list a, keeping track of the maximum/minimum, respectively.

Is there a function in some package or numpy that doesn't require loop? I need to speed up my code as my dataset is huge.

a=[4,2,6,5,2,6,9,7,10,1,2,1]
b=[];c=[];
for i in range(len(a)):
    if i==0:
        b.append(a[i])
    elif a[i]>b[-1]:
        b.append(a[i])
for i in range(len(a)):
   if i==0:
       c.append(a[i])
   elif a[i]<c[-1]:
       c.append(a[i])
#The output should be a list :
b=[4,6,9,10];c=[4,2,1] 
Ryszard Czech
  • 18,032
  • 4
  • 24
  • 37
G-09
  • 345
  • 2
  • 13
  • 1
    Didn't quite understand what the two lists are, please clarify. To find maximum in list you can use standard `max()` or `numpy.max` in numpy – Roim Jun 04 '20 at 20:16
  • 1
    The title of your question doesn't match the output that you are looking for, which is a bit confusing. – mapf Jun 04 '20 at 20:17
  • You can use `max()` and slicing to find the maximum of a portion of a list. i.e. `max(a[:5])` will find the maximum of the first five elements of a. – John Gordon Jun 04 '20 at 20:18
  • 1
    The expected output looks like a progressive collection of mins and maxes. For example, `4` is the largest value until you see 6, which is the largest until you see 9, etc. – chepner Jun 04 '20 at 20:19
  • Are you looking for ascending (``b``) and descending (``c``) values inside the initial list (``a``)? – MisterMiyagi Jun 04 '20 at 20:20
  • Why aren't you just keeping the list sorted so you would get the minimum and maximum values just by slicing the start/end of the list? – Gábor Fekete Jun 04 '20 at 20:29
  • How large do you expect ``b`` and ``c`` to be relative to ``a``? Do you expect most items to be discarded or kept? – MisterMiyagi Jun 04 '20 at 20:54
  • Sorry about the confusion. But I think my code is clear – G-09 Jun 04 '20 at 23:17

2 Answers2

3

Since you are saying you are dealing with a very large dataset, and want to avoid using loops, maybe this is a potential solution, which keeps the loops to a minimum:

def while_loop(a):
    b = [a[0]]
    c = [a[0]]
    a = np.array(a[1:])
    while a.size:
        if a[0] > b[-1]:
            b.append(a[0])
        elif a[0] < c[-1]:
            c.append(a[0])
        a = a[(a > b[-1]) | (a < c[-1])]

    return b, c

EDIT:

def for_loop(a):
    b = [a[0]]
    c = [a[0]]
    for x in a[1:]:
        if x > b[-1]:
            b.append(x)
        elif x < c[-1]:
            c.append(x)

    return b, c


print(
    timeit(lambda: while_loop(np.random.randint(0, 10000, 10000)), number=100000)
)  # 27.847886939000002
print(
    timeit(lambda: for_loop(np.random.randint(0, 10000, 10000)), number=100000)
)  # 112.90950811199998

Ok, so I just checked the timing against the regular for loop, and the while loop seems to be about 4-5x faster. No guarantee though, since this strongly seems to depend on the structure of your dataset (see comments).

mapf
  • 1,906
  • 1
  • 14
  • 40
  • 1
    Upvoting, as I suspect this will be faster in most cases (i.e., when `a` is not something like `[5,4,6,3,7,2,8,1]`, alternating between new mins and new maxes). – chepner Jun 04 '20 at 20:42
  • Thank you @chepner and good point! I'm not very sure about the performance, since I don't really have a feel yet for how expensive certain operations are, it was just something that came to mind trying to eliminate as many loops as possible. – mapf Jun 04 '20 at 20:47
  • Maybe somebody else can improve my approach. – mapf Jun 04 '20 at 20:49
  • 1
    I doubt this is faster for very large inputs in general. This is O(n^2) whereas the single-scan ``for``-loop is O(n). The ``while a.size:`` is O(n) and the nested ``a = a[(a > b[-1]) | (a < c[-1])]`` is also O(n) for a total O(n^2) . This approach should only have an advantage when each masking eliminates very many items, benefitting from the raw performance advantage of numpy arrays. Depending on the data, this might actually be a reasonable assumption, though. – MisterMiyagi Jun 04 '20 at 20:53
  • Thanks for your input @MisterMiyagi! I just ran the timing again, this time for 100,000 repetetions each, and the performance increase of the `while` loop remains to be roughly 4-5x faster than the `for` loop (27.8s and 112.9s respectively). So I guess at least for random numbers, the performance advantage of numpy arrays holds. – mapf Jun 04 '20 at 21:03
  • 1
    Yes, for random numbers this should work fairly well. Consider that if you draw ``n`` numbers from the interval [0,``n``), there is a 50% chance the first number is larger than 50% of the following items. If the number of items is larger than the range of items, that will also be an advantage (e.g. if there are only the numbers ``0`` and ``1`` your algorithm takes two steps, no matter the size). As a hunch, your algorithm probably is O(a * n) where n is the size of the list and a is the size of the alphabet (range of values). Since the O(n) part is done by numpy, that part is extremely fast. – MisterMiyagi Jun 04 '20 at 21:11
  • @MisterMiyagi those assumptions sound reasonable to me. So maybe if you really wanted the best performance, you could combine the `for` and `while` loop in a master function that decides (maybe even dynamically?) which of the two is run based on *a* and *n*. – mapf Jun 04 '20 at 21:16
  • Thank you guys. This is helpful! – G-09 Jun 04 '20 at 23:22
2

To start, you can simply initialize b and c with the first element of a. This simplifies the loop (of which you only need 1):

a = [...]
b = [a[0]]
c = [a[0]]
for x in a[1:]:
    if x > b[-1]:
        b.append(x)
    elif x < c[-1]:
        c.append(x)

Note that inside the loop, a value of x cannot be both larger than the current maximum and smaller than the current minimum, hence the elif rather than two separate if statements.

Another optimization would be two use additional variables to avoid indexing b and c repeatedly, as well as an explicit iterator to avoid making a shallow copy of a.

a = [...]
a_iter = iter(a)
curr_min = curr_max = next(a_iter)
b = [curr_max]
c = [curr_min]
for x in a_iter:
    if x > curr_max:
        b.append(x)
        curr_max = x
    elif x curr_min:
        c.append(x)
        curr_min = x
chepner
  • 497,756
  • 71
  • 530
  • 681