0

I got a dataset with millions of rows where I would like to eliminate rows close to eachother where the difference is below threshold x.

Here is an example of the dataset:

1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 3
1 1 1 1 1 1 1 1 1 1 1 1 2
1 1 1 1 1 1 1 1 1 1 1 3 1
1 1 1 1 1 1 1 1 1 1 1 3 3
1 1 1 1 1 1 1 1 1 1 1 3 2

As you can see rows 1-3 and 4-6 are identical except the last column. So, given threshold of 1 I would like to outcome to be two rows (one row from 1-3 and one row from 4-6). For threshold 2 we can see how all rows are identical for all columns except the last two, therefore only one row should be the outcome. To be honest it doesnt really matter which row to keep (first, last, random).

I've tried the following code from another thread but it only returns one row: (Removing *NEARLY* Duplicate Observations - Python)

from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
import numpy as np

#load dataframe
data = pd.read_csv(r'testfile.csv')
df = pd.DataFrame(data, columns=['File_1', 'File_2', 'File_3', 'File_4', 'File_5', 'File_6', 'File_7', 'File_8', 'File_9', 'File_10', 'File_11', 'File_12', 'File_13'])

def dedupe_partially_vectorized(df, threshold=1):
    """
    - Iterate through each row starting from the last; examine all previous rows for duplicates.
    - If found, it is appended to a list of duplicate indices.
    """
    # convert field data to integers
    enc = OrdinalEncoder()
    X = enc.fit_transform(df.to_numpy())

    """
    - loop through each row, starting from last
    - for each `row`, calculate hamming distance to all previous rows
    - if any such distance is `threshold` or less, mark `idx` as duplicate
    - loop ends at 2nd row (1st is by definition not a duplicate)
    """
    dupe_idx = []
    for j in range(len(X) - 1):
        idx = len(X) - j - 1
        row = X[idx]
        prev_rows = X[0:idx]
        dists = np.sum(row != prev_rows, axis=1)
        if min(dists) <= threshold:
            dupe_idx.append(idx)
        dupe_idx = sorted(dupe_idx)
    df_dupes = df.iloc[dupe_idx]
    df_deduped = df.drop(dupe_idx)
    return (df_deduped, df_dupes)

#send output to csv
(df_deduped, df_dupes) = dedupe_partially_vectorized(df)

print(df_deduped)

When running the code on the above dataset it returns:

   File_1  File_2  File_3  File_4  ...  File_10  File_11  File_12  File_13
0       1       1       1       1  ...        1        1        1        1

[1 rows x 13 columns]

Process finished with exit code 0

I'd be really helpful if someone can help me solve this!

Edit:

After trying the new code it doesn't remove all nearly observations: Example of a dataset which should only keep one row out of seven (right now keeps 5/7):

1 1 3 3 3 2 1 2 1 1 2 2 1 
1 3 3 3 3 2 1 2 1 1 2 2 1  
1 1 3 3 3 2 1 2 1 1 2 2 2  
1 1 3 3 3 2 3 2 1 1 2 2 1  
1 1 3 3 3 2 1 2 1 1 2 1 1  
1 1 3 3 3 2 1 2 1 1 2 3 1  
1 1 3 3 3 2 1 2 2 1 2 2 1
OldSport
  • 137
  • 1
  • 12

1 Answers1

0

One way unsg itertools.combinations with drop_duplicates:

from itertools import combinations

n = 1
colsets = [c for c in combinations(df.columns, len(df.columns) - n)]
min((df.drop_duplicates(subset=c) for c in colsets), key=len)

Output:

   0   1   2   3   4   5   6   7   8   9   10  11  12
0   1   1   1   1   1   1   1   1   1   1   1   1   1
3   1   1   1   1   1   1   1   1   1   1   1   3   1
Chris
  • 29,127
  • 3
  • 28
  • 51
  • I've now tried the code on a larger dataset and it doesn't seem to remove all nearly duplicate values. It worked perfectly on previous dataset but when looking at this one it should only keep 1 row but it keeps 5 out of 7 rows. `1 1 3 3 3 2 1 2 1 1 2 2 1 1 3 3 3 3 2 1 2 1 1 2 2 1 1 1 3 3 3 2 1 2 1 1 2 2 2 1 1 3 3 3 2 3 2 1 1 2 2 1 1 1 3 3 3 2 1 2 1 1 2 1 1 1 1 3 3 3 2 1 2 1 1 2 3 1 1 1 3 3 3 2 1 2 2 1 2 2 1 ` – OldSport Aug 18 '21 at 11:55
  • Can you post the new sample directly to the question? – Chris Aug 18 '21 at 13:45
  • Edited original question. – OldSport Aug 18 '21 at 13:51
  • @OldSport I checked and works fine. Your new sample has 5 heterogeneous columns; thus you would need to set `n` to 5. – Chris Aug 19 '21 at 02:01
  • Hm okey. I've tried again. In the original dataset we have millions of combinations giving the possibility of rows where there are 12 heterogeneous columns. I'd like to be able to remove all rows which has an almost identical match in the dataset but differs on x columns where the output gives me the minimum number of rows which has at least hamming distance of x or similair between all other remaining rows. Can I achieve it with this code? – OldSport Aug 19 '21 at 07:43
  • The code is basically trying to remove duplicates without considering n columns. Suppose your `n` is 2, it will consider rows with at most 2 different values equivalent. – Chris Aug 19 '21 at 07:52
  • Sounds like what I'm looking for strange it doesn't give me the result I'm looking for. – OldSport Aug 19 '21 at 08:54