How to find near by duplicates using LSH in a dataframe?

Question

I have a few columns like Filename, size of the file, and the date and I want to find the nearby duplicates by considering all parameters.

Id        Name                          Size        Date
1   lib_mysqludf_sys.html               8934    2020-11-10 06:25:57
2   lib_mysqludf_sys.c                  8715    2020-11-10 12:12:41
3   lib_mysqludf_sys.so                 8480    2020-11-10 08:51:33
4   install.sh                          1544    2020-11-10 12:17:16
5   lib_mysqludf_sys.sql                7900    2020-11-10 06:25:59
6   Makefile                            124     2020-11-10 06:36:43
7   lib_mysqludf_sys-master             4096    2020-11-10 12:12:41
8   cmake-3.17.0.tar.gz                 9466484 2020-11-09 08:23:31
9   fileclassification.cpython-36.pyc   522     2020-11-03 12:00:43
10  fileclassification.cpython-38.pyc   518     2020-11-04 05:49:24
11  __pycache__                         4096    2020-11-04 05:49:24
12  fileclassification.py               272     2020-11-03 12:00:41
13  asset_classifier                    4096    2020-11-03 12:00:42
14  pyvenv.cfg                          69      2020-11-04 04:56:36

As above dataframe, we have 4 files that have the nearby file name, size, and date.

Expected output

Id    Name                     Near Duplicates  
1     lib_mysqludf_sys.html    ['lib_mysqludf_sys.c','lib_mysqludf_sys.so',
                                'lib_mysqludf_sys.html','lib_mysqludf_sys.sql']

Unless what nearby files is as clear cut as in you example, it'll be shaky. Do you have a more precise logic or definition of "nearbyness"? Eg are lines 9 and 10 nearby? They could very well be totally different files. — Serge de Gosson de Varennes, Dec 24 '20 at 10:09
@SergedeGossondeVarennes Yes, they are nearby duplicates because their filename, size, and date are nearly matching. — sheel, Dec 24 '20 at 10:14

How to find near by duplicates using LSH in a dataframe?

0 Answers0