0

I have an Pandas dataframe in which I store binary data in an column of ~360.000 entries. I am looking for a way to find the changes between 0 -> 1 and 1 -> 0 in a more efficient way.

Currently I iterate through it and check for the specific conditions by evaluating it for each index, which is maybe quite descriptive to read, but since the fuctionality is used several times, really is the bottleneck of a larger script. The last index is left unchecked, but this is not crutial.

for i in range(0, len(df.Binary) - 1):
    if df.Binarywindow[i] == 0 and df.Binarywindow[i+1] == 1:
        startedge.append(i)
    elif df.Binarywindow[i] == 1 and df.Binarywindow[i+1] == 0:
        endedge.append(i)

Can you help me rewrite it?

Dr.Ario
  • 11
  • 5

1 Answers1

0

The method you mentioned will indeed yield quite slow results for large sets of data, due to the way that append() methods interact with memory. Essentially you are rewriting the same part of memory ~360,000 times, extending it with a single entry. You can speed this up significantly by converting to numpy arrays and using a single operation to search for the edges. I wrote a minimal example to demonstrate with a random set of binary data.

binaries = np.random.randint(0,2,200000)
Binary = pd.DataFrame(binaries)

t1 = time.time()

startedge, endedge = pd.DataFrame([]), pd.DataFrame([])
for i in range(0, len(Binary) - 1):
    if Binary[0][i] == 0 and Binary[0][i+1] == 1:
        startedge.append([i])
    elif Binary[0][i] == 1 and Binary[0][i+1] == 0:
        endedge.append([i])

t2 = time.time()
print(f"Looping through took {t2-t1} seconds")

# Numpy based method, including conversion of the dataframe
t1 = time.time()
binary_array = np.array(Binary[0])

startedges = search_sequence_numpy(binary_array, np.array([0,1]))
stopedges = search_sequence_numpy(binary_array, np.array([1,0]))

t2 = time.time()
print(f"Converting to a numpy array and looping through required {t2-t1} seconds")

Output:

Looping through took 56.22933220863342 seconds
Converting to a numpy array and looping through required 0.029932022094726562  seconds

For the sequence search function I used the code from this answer Searching a sequence in a NumPy array

def search_sequence_numpy(arr,seq):
""" Find sequence in an array using NumPy only.

Parameters
----------    
arr    : input 1D array
seq    : input 1D array

Output
------    
Output : 1D Array of indices in the input array that satisfy the 
matching of input sequence in the input array.
In case of no match, an empty list is returned.
"""

# Store sizes of input array and sequence
Na, Nseq = arr.size, seq.size

# Range of sequence
r_seq = np.arange(Nseq)

# Create a 2D array of sliding indices across the entire length of input array.
# Match up with the input sequence & get the matching starting indices.
M = (arr[np.arange(Na-Nseq+1)[:,None] + r_seq] == seq).all(1)

# Get the range of those indices as final output
if M.any() >0:
    return np.where(np.convolve(M,np.ones((Nseq),dtype=int))>0)[0]
else:
    return []         # No match found
JaHoff
  • 196
  • 5