I got a dataset with millions of rows where I would like to eliminate rows close to eachother where the difference is below threshold x.
Here is an example of the dataset:
1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 3
1 1 1 1 1 1 1 1 1 1 1 1 2
1 1 1 1 1 1 1 1 1 1 1 3 1
1 1 1 1 1 1 1 1 1 1 1 3 3
1 1 1 1 1 1 1 1 1 1 1 3 2
As you can see rows 1-3 and 4-6 are identical except the last column. So, given threshold of 1 I would like to outcome to be two rows (one row from 1-3 and one row from 4-6). For threshold 2 we can see how all rows are identical for all columns except the last two, therefore only one row should be the outcome. To be honest it doesnt really matter which row to keep (first, last, random).
I've tried the following code from another thread but it only returns one row: (Removing *NEARLY* Duplicate Observations - Python)
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
import numpy as np
#load dataframe
data = pd.read_csv(r'testfile.csv')
df = pd.DataFrame(data, columns=['File_1', 'File_2', 'File_3', 'File_4', 'File_5', 'File_6', 'File_7', 'File_8', 'File_9', 'File_10', 'File_11', 'File_12', 'File_13'])
def dedupe_partially_vectorized(df, threshold=1):
"""
- Iterate through each row starting from the last; examine all previous rows for duplicates.
- If found, it is appended to a list of duplicate indices.
"""
# convert field data to integers
enc = OrdinalEncoder()
X = enc.fit_transform(df.to_numpy())
"""
- loop through each row, starting from last
- for each `row`, calculate hamming distance to all previous rows
- if any such distance is `threshold` or less, mark `idx` as duplicate
- loop ends at 2nd row (1st is by definition not a duplicate)
"""
dupe_idx = []
for j in range(len(X) - 1):
idx = len(X) - j - 1
row = X[idx]
prev_rows = X[0:idx]
dists = np.sum(row != prev_rows, axis=1)
if min(dists) <= threshold:
dupe_idx.append(idx)
dupe_idx = sorted(dupe_idx)
df_dupes = df.iloc[dupe_idx]
df_deduped = df.drop(dupe_idx)
return (df_deduped, df_dupes)
#send output to csv
(df_deduped, df_dupes) = dedupe_partially_vectorized(df)
print(df_deduped)
When running the code on the above dataset it returns:
File_1 File_2 File_3 File_4 ... File_10 File_11 File_12 File_13
0 1 1 1 1 ... 1 1 1 1
[1 rows x 13 columns]
Process finished with exit code 0
I'd be really helpful if someone can help me solve this!
Edit:
After trying the new code it doesn't remove all nearly observations: Example of a dataset which should only keep one row out of seven (right now keeps 5/7):
1 1 3 3 3 2 1 2 1 1 2 2 1
1 3 3 3 3 2 1 2 1 1 2 2 1
1 1 3 3 3 2 1 2 1 1 2 2 2
1 1 3 3 3 2 3 2 1 1 2 2 1
1 1 3 3 3 2 1 2 1 1 2 1 1
1 1 3 3 3 2 1 2 1 1 2 3 1
1 1 3 3 3 2 1 2 2 1 2 2 1