3

Suppose I have a list that goes like : ''' [1,2,3,4,9,10,11,20] ''' I need the result to be like : ''' [[4,9],[11,20]] ''' I have defined a function that goes like this :

def get_range(lst):
i=0
seqrange=[]
for new in lst:
    a=[]
    start=new
    end=new
    if i==0:
        i=1
        old=new
    else:
        if new - old >1:
            a.append(old)
            a.append(new)
    old=new
    if len(a):
        seqrange.append(a)
return seqrange

Is there any other easier and efficient way to do it? I need to do this in the range of millions.

SidMenon
  • 35
  • 3

4 Answers4

2

You can use numpy arrays and the diff function that comes along with them. Numpy is so much more efficient than looping when you have millions of rows.


Slight aside: Why are numpy arrays so fast? Because they are arrays of data instead of arrays of pointers to data (which is what Python lists are), because they offload a whole bunch of computations to a backend written in C, and because they leverage the SIMD paradigm to run a Single Instruction on Multiple Data simultaneously.


Now back to the problem at hand:

The diff function gives us the difference between consecutive elements of the array. Pretty convenient, given that we need to find where this difference is greater than a known threshold!

import numpy as np

threshold = 1
arr = np.array([1,2,3,4,9,10,11,20])

deltas = np.diff(arr)
# There's a gap wherever the delta is greater than our threshold
gaps = deltas > threshold 
gap_indices = np.argwhere(gaps)

gap_starts = arr[gap_indices]
gap_ends = arr[gap_indices + 1] 

# Finally, stack the two arrays horizontally
all_gaps = np.hstack((gap_starts, gap_ends))
print(all_gaps)
# Output: 
# [[ 4  9]
#  [11 20]]

You can access all_gaps like a 2D matrix: all_gaps[0, 1] would give you 9, for example. If you really need the answer as a list-of-lists, simply convert it like so:

all_gaps_list = all_gaps.tolist()
print(all_gaps_list)
# Output: [[4, 9], [11, 20]]

Comparing the runtime of the iterative method from @happydave's answer with the numpy method:

import random
import timeit

import numpy

def gaps1(arr, threshold):
    deltas = np.diff(arr)
    gaps = deltas > threshold 
    gap_indices = np.argwhere(gaps)
    gap_starts = arr[gap_indices]
    gap_ends = arr[gap_indices + 1] 
    all_gaps = np.hstack((gap_starts, gap_ends))
    return all_gaps

def gaps2(lst, thr):
    seqrange = []
    for i in range(len(lst)-1):
      if lst[i+1] - lst[i] > thr:
        seqrange.append([lst[i], lst[i+1]])
    return seqrange

test_list = [i for i in range(100000)]
for i in range(100):
    test_list.remove(random.randint(0, len(test_list) - 1))

test_arr = np.array(test_list)

# Make sure both give the same answer:
assert np.all(gaps1(test_arr, 1) == gaps2(test_list, 1))

t1 = timeit.timeit('gaps1(test_arr, 1)', setup='from __main__ import gaps1, test_arr', number=100)
t2 = timeit.timeit('gaps2(test_list, 1)', setup='from __main__ import gaps2, test_list', number=100)

print(f"t1 = {t1}s; t2 = {t2}s; Numpy gives ~{t2 // t1}x speedup")

On my laptop, this gives:

t1 = 0.020834800001466647s; t2 = 1.2446780000027502s; Numpy gives ~59.0x speedup

My word that's fast!

Pranav Hosangadi
  • 23,755
  • 7
  • 44
  • 70
1

There is iterator based solution. It'is allow to get intervals one by one:

flist =  [1,2,3,4,9,10,11,20]


def get_range(lst):
    start_idx = lst[0]
    for current_idx in flist[1:]:
        if current_idx > start_idx+1:
            yield [start_idx, current_idx]
        start_idx = current_idx


for inverval in get_range(flist):
    print(inverval)
Stanislav Ivanov
  • 1,854
  • 1
  • 16
  • 22
0

I don't think there's anything inefficient about the solution, but you can clean up the code quite a bit:

seqrange = []
for i in range(len(lst)-1):
  if lst[i+1] - lst[i] > 1:
    seqrange.append([lst[i], lst[i+1]])
happydave
  • 7,127
  • 1
  • 26
  • 25
  • Thank you so much. Don't know the feasibility when using millions of numbers though, will update upon testing. – SidMenon Oct 07 '20 at 16:42
0

I think this could be more efficient and a bit cleaner.

def func(lst):
    ans=0
    final=[]
    sol=[]
    for i in range(1,lst[-1]+1):
        if(i not in lst):
            ans+=1
            final.append(i)
        elif(i in lst and ans>0):
            final=[final[0]-1,i]
            sol.append(final)
            ans=0
            final=[]
        else:
            final=[]
    return(sol)
        
    
Ansh Arora
  • 25
  • 1
  • 7