I've tried to apply following code on my datasample which I found in an older thread (Removing *NEARLY* Duplicate Observations - Python):
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
import numpy as np
def dedupe_partially_vectorized(df, threshold=1):
"""
- Iterate through each row starting from the last; examine all previous rows for duplicates.
- If found, it is appended to a list of duplicate indices.
"""
# convert field data to integers
enc = OrdinalEncoder()
X = enc.fit_transform(df.to_numpy())
"""
- loop through each row, starting from last
- for each `row`, calculate hamming distance to all previous rows
- if any such distance is `threshold` or less, mark `idx` as duplicate
- loop ends at 2nd row (1st is by definition not a duplicate)
"""
dupe_idx = []
for j in range(len(X) - 1):
idx = len(X) - j - 1
row = X[idx]
prev_rows = X[0:idx]
dists = np.sum(row != prev_rows, axis=1)
if min(dists) <= threshold:
dupe_idx.append(idx)
dupe_idx = sorted(dupe_idx)
df_dupes = df.iloc[dupe_idx]
df_deduped = df.drop(dupe_idx)
return (df_deduped, df_dupes)
The problem with my sample is that it contains all possible combinations and I just want to keep one entry within a certain threshold. Therefore I understand it as I need to drop the rows during each loop (in difference with above which labels the rows and drops them at the end). I've tried to rewrite the code to loop through the dists and append all distances which meet the condition to a list and then drop it from df and empty the list.
My code is following:
enc = OrdinalEncoder()
X = enc.fit_transform(df.to_numpy())
dupe_idx = []
threshold = 1
for j in range(len(X) - 1):
idx = len(X) - j - 1
row = X[idx]
prev_rows = X[0:idx]
dists = np.sum(row != prev_rows, axis=1)
for i in range(len(dists)):
if dists[i] <= threshold:
dupe_idx.append(i)
df = df.drop(df.index[dupe_idx])
dupe_idx = []
print(df)
It works great for the first loop but gets error on the next one: IndexError: index 666 is out of bounds for axis 0 with size 661
I think I need som help to think regarding the lines below dists = np.sum, I'm certainly missing someting.