0

Is a nested for-loop necessary for this code, or is there a more efficient work-around?

This is a simplified version which searches for successive, overlapping intervals within a data set comprised of 20 random integers from 1 to 1000. It runs through error values of 1-100 to create the intervals by adding/subtracting them from the 20 random integers.


Example:

input assuming data frame of size 10 instead of 20:

df = [433, 3, 4, 5, 6, 7, 378, 87, 0, 500]

output for error = 1 in for-loop:

overlaps = {0:[[1, 2, 3, 4, 5]]}

def find_overlap(df, error):
    """
    df: dataframe with random 20 integers from 1-1000
    error: used to create the interval by +/- to each value in the dataframe
    returns: list of list of indexes overlapping
    """

    # add the interval to the dataframe as columns of minimum and maximum
    df["min"] = df["x"] - error
    df["max"] = df["x"] + error

    # overlaps stores lists of indexes that overlap
    overlaps = []

    # fill in data for start
    temporary = [0]
    minimum = df["min"].iloc[0]
    maximum = df["min"].iloc[0]

    # iterates through the dataframe checking for overlap between successive intervals
    for index , row in df.iterrows():
        current_min = row["min"]
        current_max = row["max"]

        # yes overlap
        if (current_min <= maximum) and (current_max >= minimum):
            temporary.append(index)
            if current_min > minimum:
                minimum = current_min
            if current_max < maximum:
                maximum = current_max
            continue

        # no overlap - also check for 5 successive overlaps
        if len(temporary) >= 5:
            overlaps.append(temporary)
        temporary = [index]
        minimum = current_min
        maximum = current_max

    return overlaps



# creates dataframe with 20 random integers from 1 to 1000
df = pd.DataFrame(np.random.randint(1, 1000, 20), columns=["x"])

overlaps = {}
for error in range(0,100):
    lst = find_overlap(df, error)
    if len(lst):
        overlaps[error] = lst

print(overlaps)
Brandon
  • 89
  • 7
  • 6
    Perhaps instead of having us parse through this rather long program, you could provide some expected input and some expected output with the rules for how and why the output is expected? [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – Henry Ecker Jul 22 '21 at 01:50
  • Sorry, is this better? – Brandon Jul 22 '21 at 02:18

1 Answers1

1

So, from what I've understood out of your code... You're looking to:

  1. Compute the difference between all values of x.
  2. Determine if it's less than error where error ranges from [0, 100).
  3. Select all subarrays of size 5.

Assuming my interpretation is correct... You can actually vectorize this and avoid for-loops, like your intuition led you to believe. Ultimately, if my interpretation is incorrect, this should, at least, give you a decent start to creating a vectorized version of your desired code.

Updated Solution (considers 5-tuples)

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(1, 1000, 20), columns=["x"])

overlaps = {}

for margin in range(0, 100):
    diffs = np.abs(df["x"].values - np.roll(df["x"], margin))
    # np.convolve is analogous to a sliding window sum
    quint = np.convolve(diffs == margin, np.ones(5), "valid")
    index = np.nonzero(quint == 5)[0]
    if index.size > 0:
        overlaps[margin] = [list(range(i, i + 5)) for i in index]

Original Solution (doesn't consider 5-tuples)

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(1, 1000, 20), columns=["x"])

overlaps = {}

for margin in range(0, 100):
    diffs = np.abs(df["x"].values - np.roll(df["x"], margin))
    index = np.nonzero(diff == margin)[0]
    if idx.size > 0:
        overlaps[margin] = idx

In case you're unfamiliar with numpy, .size gives you the total size of the ndarray. (So a 3D array with shapes (10, 20, 30) has a size of 6000.)

John
  • 320
  • 2
  • 9