0

I am pretty new to coding and am currently struggling on how to optimize this code for larger lists.

    import pandas as pd
import random
from time import time

rows = []

list1 = [random.randint(1, 100) for i in range(1_000_000)]
list2 = [random.randint(1, 100) for i in range(1_000_000)]
list3 = [random.randint(1, 100) for i in range(1_000_000)]
list4 = [random.randint(1, 100) for i in range(1_000_000)]

start = time()

for i in range(len(list1) - 1):
    if list1[i] < list2[i] and list1[i + 1] > list2[i + 1]:
        dict1 = {1: list1[i], 2: '+'}
        rows.append(dict1)
    elif list1[i] > list2[i] and list1[i + 1] < list2[i + 1]:
        dict1 = {1: list1[i], 2: '-'}
        rows.append(dict1)

    if list3[i] < list4[i] and list3[i + 1] > list4[i + 1]:
        dict1 = {1: list3[i], 2: '+'}
        rows.append(dict1)
    elif list3[i] > list4[i] and list3[i + 1] < list4[i + 1]:
        dict1 = {1: list3[i], 2: '-'}
        rows.append(dict1)
    else:
        dict1 = {1: list3[i], 2: '#'}
        rows.append(dict1)
end = time()
print(end - start)
df = pd.DataFrame(rows)

with 10_000_000 entries it takes about 30 sec. It grows linear. Is there a way to optimize it for larger numbers?

I feel like the for-loop and the if-else statements are the biggest time consumers, but I can't figure out a way to optimize them.

TiRePS
  • 1
  • 1
  • It won't get exponentially larger, it will get linearly larger. It's O(N). What are you actually trying to accomplish here? – Tim Roberts May 03 '21 at 01:58
  • BTW, your first statement has a typo: the last `list1` should be `list2`. The whole concept here is flawed. If `list1[i] < list2[i]`, then BY DEFINITION `list1[i]+1 < list2[i] + 1`. I don't think this is what you intended to do. – Tim Roberts May 03 '21 at 02:00
  • Are you looking for where these two series cross? I assume you mean `if list1[i] < list2[i] and list1[i+1] > list2[i+1]`. If so, the way to do this is to use `numpy`, subtract the two lists, and look for zero crossings with `np.where(np.diff(np.sign(x)))`. – Tim Roberts May 03 '21 at 02:06
  • Im sorry about the mistakes, I fixed the typos now. – TiRePS May 03 '21 at 02:17
  • Ok, thank you, I will try it. – TiRePS May 03 '21 at 02:17

1 Answers1

0

Thanks to @Tim Roberts and this question I figured out a way to optimize my code.

from time import time

list1 = np.array([random.randint(1, 100) for i in range(10_000_000)])
list2 = np.array([random.randint(1, 100) for i in range(10_000_000)])
list3 = np.array([random.randint(1, 100) for i in range(10_000_000)])
list4 = np.array([random.randint(1, 100) for i in range(10_000_000)])

start = time()
list_sub1 = np.subtract(list2, list1)
list_sub2 = np.subtract(list4, list3)
positive = list_sub1 > 0
positive2 = list_sub2 > 0
results = np.bitwise_xor(positive[1:], positive[:-1]).nonzero()[0]
results2 = np.bitwise_xor(positive2[1:], positive2[:-1]).nonzero()[0]
end = time()
print(end-start)

This way it only takes .25sec to analyze 10_000_000 entries instead of 30sec.

I changed the np.where(np.diff(np.sign(x)))suggested by @Tim Roberts to np.bitwise_xor(positive[1:], positive[:-1]).nonzero()[0] because this gave me a time increase x2.

NOTE: This program does treat 0 as negative and only registers a change with >0 (1/-1). With np.sign() 0 is treated correctly(1/0/-1). In my case I don't need any further distinction so I chose the quicker method.

The Numpy Doc is great if you're new to vectorizing and numpy overall.

TiRePS
  • 1
  • 1