Remove rows in Dataframe based on Hamming Distance within array

Question

I've tried to apply following code on my datasample which I found in an older thread (Removing *NEARLY* Duplicate Observations - Python):

from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
import numpy as np

def dedupe_partially_vectorized(df, threshold=1):
    """
    - Iterate through each row starting from the last; examine all previous rows for duplicates.  
    - If found, it is appended to a list of duplicate indices.
    """
    # convert field data to integers
    enc = OrdinalEncoder()
    X = enc.fit_transform(df.to_numpy())

    """
    - loop through each row, starting from last
    - for each `row`, calculate hamming distance to all previous rows
    - if any such distance is `threshold` or less, mark `idx` as duplicate
    - loop ends at 2nd row (1st is by definition not a duplicate)
    """
    dupe_idx = []          
    for j in range(len(X) - 1):
        idx = len(X) - j - 1
        row = X[idx]
        prev_rows = X[0:idx]
        dists = np.sum(row != prev_rows, axis=1)
        if min(dists) <= threshold:
            dupe_idx.append(idx)
        dupe_idx = sorted(dupe_idx)
    df_dupes = df.iloc[dupe_idx]
    df_deduped = df.drop(dupe_idx)
    return (df_deduped, df_dupes)

The problem with my sample is that it contains all possible combinations and I just want to keep one entry within a certain threshold. Therefore I understand it as I need to drop the rows during each loop (in difference with above which labels the rows and drops them at the end). I've tried to rewrite the code to loop through the dists and append all distances which meet the condition to a list and then drop it from df and empty the list.

My code is following:

enc = OrdinalEncoder()
X = enc.fit_transform(df.to_numpy())
dupe_idx = []

threshold = 1

for j in range(len(X) - 1):
    idx = len(X) - j - 1
    row = X[idx]
    prev_rows = X[0:idx]
    dists = np.sum(row != prev_rows, axis=1)
    for i in range(len(dists)):
        if dists[i] <= threshold:
            dupe_idx.append(i)
    df = df.drop(df.index[dupe_idx])
    dupe_idx = []
print(df)

It works great for the first loop but gets error on the next one: IndexError: index 666 is out of bounds for axis 0 with size 661

I think I need som help to think regarding the lines below dists = np.sum, I'm certainly missing someting.

score 1 · Answer 1 · answered Aug 23 '21 at 22:05

1

The problem is here:

for j in range(len(X) - 1):

You are iterating over the size of X, but X is not getting updated after dropping the near-duplicates that was found in the first iteration.

As a result, in the second iteration, df is smaller than X, and the error is raised when you try to access with an index that is larger than the size of df.

answered Aug 23 '21 at 22:05

ronpi

470
3
8

Right, will try to figure out how to solve that. – OldSport Aug 24 '21 at 11:38

Remove rows in Dataframe based on Hamming Distance within array

1 Answers1