0

I got stuck with my MySQL statement. I have a huge database table (140.000 rows) with file pathes, MD5 hashes and code versions. Now I want get rows that can identify versions based on MD5 hashes of a specific file.

Something like these

Path                    MD5                                 Version
----------------------------------------------------------------------------
my/path/to/file         12B9DA49A62D8E7B21B5AA2C6C3126CB    1.2.0
my/path/to/file         12B9DA49A62D8E7B21B5AA2C6C3126CB    1.2.1
my/path/to/file         FDA9A7D31ECC70F99D533D88F9EDE441    1.2.2    
my/path/to/file         E68CC0233A1719A17CDF5D7CFB9539A8    1.2.3
my/path/to/file         E68CC0233A1719A17CDF5D7CFB9539A8    1.2.4

So version 1.2.2 can get identifed easily by file and hash since this MD5 value for the file is unique.

Now I need the SQL Statement for that query. I have tried with DISTINCT, with HAVING COUNT with GROUP BY. But no success at all. Actually I am at the beginning again with

SELECT DISTINCT
   md5
FROM 
  source_files
  INNER JOIN (
    SELECT
    *
    FROM
      source_files
  ) subq ON source_files.md5 = subq.md5

since I need all columns and not only the md5 column which get returned by SELECT DISTINCT md5 FROM source_files. But this also throws an error. Also I need to group by version since I just need a single combination / version to identify it.

Multi-File Check

Also the next level would be: If a version could not get identified by a unique MD5 hash since there is no unqiue MD5 hash for that version it should get "tested" against two files, or three files, or four... Since there is an unqiuie combination of files / MD5 hashes for a version.

Here an updated example for multi file checking

Path                    MD5                                 Version
----------------------------------------------------------------------------
my/path/to/file         12B9DA49A62D8E7B21B5AA2C6C3126CB    1.2.0
my/path/to/file         12B9DA49A62D8E7B21B5AA2C6C3126CB    1.2.1
my/path/to/file         FDA9A7D31ECC70F99D533D88F9EDE441    1.2.2    
my/path/to/file         E68CC0233A1719A17CDF5D7CFB9539A8    1.2.3    <- same file
my/path/to/file         E68CC0233A1719A17CDF5D7CFB9539A8    1.2.4    <- same file

my/path/to/file2        884F676DE41CB35951322B98654E046C    1.2.0    <- same file
my/path/to/file2        884F676DE41CB35951322B98654E046C    1.2.1    <- same file
my/path/to/file2        884F676DE41CB35951322B98654E046C    1.2.2    <- same file
my/path/to/file2        884F676DE41CB35951322B98654E046C    1.2.3    <- same file
my/path/to/file2        3182B2BEDDB1F798F66D27425B9F99D9    1.2.4
my/path/to/file2        3182B2BEDDB1F798F66D27425B9F99D9    1.2.5

So Version 1.2.3 could get identified by 4th "file" and 4th "file2"

Kevin Lieser
  • 951
  • 1
  • 9
  • 25

2 Answers2

1
SELECT * FROM source_files
INNER JOIN (
     SELECT MD5, COUNT(1) AS CountOf 
     FROM source_files 
     GROUP BY MD5
) tbl
ON source_files.MD5=tbl.MD5
WHERE tbl.CountOf = 1

Edited. Going beyond:

If a version could not get identified by a unique MD5 hash since there is no unqiue MD5 hash for that version it should get "tested" against two files, or three files, or four... Since there is an unqiuie combination of files / MD5 hashes for a version.

Then it looks like the PK of the source_files table are both fields path and MD5 combined. Details about tables with multiple columns as Primary Keys here or here.

In that case you can alter previous query and GROUP BY on multiple columns (details here):

SELECT * FROM source_files
INNER JOIN (
     SELECT MD5, COUNT(1) AS CountOf 
     FROM source_files 
     GROUP BY path, MD5
) tbl
ON source_files.MD5=tbl.MD5
WHERE tbl.CountOf = 1
Community
  • 1
  • 1
MarcM
  • 2,173
  • 22
  • 32
  • Working also... I figured out the following `SELECT * FROM (SELECT *, COUNT(md5) AS md5Check FROM source_files GROUP BY md5 HAVING md5Check = 1) as preFiltered GROUP BY version ORDER BY version` Thanks. Any idea for the multiple file check? – Kevin Lieser May 02 '17 at 12:21
  • Answer edited. Not sure wether that is your working scenerio. – MarcM May 02 '17 at 13:43
  • Mh. I don't really know...I need to fetch rows with pathes and md5 combinations that are only available for a single version. I think I need to pipe through all rows and create a sub query for a second file to check. Check updates example data above: Version 1.2.3. has no uniqe MD5 but can get identified by checking file one and file two. The result of my query should return the needed files to check to identify the version. So in this case `my/path/to/file E68CC0233A1719A17CDF5D7CFB9539A8 1.2.3`and `my/path/to/file2 884F676DE41CB35951322B98654E046C 1.2.3` – Kevin Lieser May 02 '17 at 15:56
0

You should have primary id

select
    ID, 
    PATH,
    MD5,
    Version
from source_files LEFT JOIN 
(
    select *, count(MD5) count from source_files group by MD5
) sf on sf.ID = source_files.ID having sf.count = 1
  • Seems to return all rows... Got it working with `SELECT * FROM (SELECT *, COUNT(md5) AS md5Check FROM source_files GROUP BY md5 HAVING md5Check = 1) as preFiltered GROUP BY version ORDER BY version` – Kevin Lieser May 02 '17 at 11:52