3

After moving and backing up my photo collection a few times I have several duplicate photos, with different filenames in various folders scattered across my PC. So I thought I would write a quick CF (9) page to find the duplicates (and can then add code later to allow me to delete them).

I have a couple of queries:-

  1. At the moment I am just using file size to match the image file, but I presume matching EXIF data or matching hash of image file binary would be more reliable?

  2. The code I lashed together sort of works, but how could this be done to search outside web root?

  3. Is there a better way?

p

<cfdirectory 
name="myfiles" 
directory="C:\ColdFusion9\wwwroot\images\photos" 
filter="*.jpg"
recurse="true"
sort="size DESC"
type="file" >


<cfset matchingCount=0>
<cfset duplicatesFound=0>
<table border=1>
<cfloop query="myFiles" endrow="#myfiles.recordcount#-1">

    <cfif myfiles.size is myfiles.size[currentrow + 1]>
        <!---this file is the same size as the next row--->
        <cfset matchingCount = matchingCount + 1>
        <cfset duplicatesFound=1>
    <cfelse>
        <!--- the next file is a different size --->

        <!--- if there have been matches, display them now ---> 
        <cfif matchingCount gt 0>   

            <cfset sRow=#currentrow#-#matchingCount#>
            <cfoutput><tr>
            <cfloop index="i" from="#sRow#" to="#currentrow#"> 
                    <cfset imgURL=#replace(directory[i], "C:\ColdFusion9\wwwroot\", "http://localhost:8500/")#>
                    <td><a href="#imgURL#\#name[i]#"><img height=200 width=200 src="#imgURL#\#name[i]#"></a></td>
            </cfloop></tr><tr>
            <cfloop index="i" from="#sRow#" to="#currentrow#"> 
                <td width=200>#name[i]#<br>#directory[i]#</td>
            </cfloop>
            </tr>
            </cfoutput>

            <cfset matchingCount = 0>

        </cfif> 
    </cfif>
</cfloop>
</table>
<cfif duplicatesFound is 0><cfoutput>No duplicate jpgs found</cfoutput></cfif>
Saul
  • 1,387
  • 5
  • 23
  • 45

2 Answers2

3

This is pretty fun task, so I've decided to give it a try.

First, some testing results on my laptop with 4GB RAM, 2x2.26Ghz CPU and SSD: 1,143 images, total 263.8MB.

ACF9: 8 duplicates, took ~2.3 s

Railo 3.3: 8 duplicates, took ~2.0 s (yay!)

I've used great tip from this SO answer to pick the best hashing option.

So, here is what I did:

<cfsetting enablecfoutputonly="true" />

<cfset ticks = getTickCount() />

<!--- this is great set of utils from Apache --->
<cfset digestUtils = CreateObject("java","org.apache.commons.codec.digest.DigestUtils") />

<!--- cache containers --->
<cfset checksums = {} />
<cfset duplicates = {} />

<cfdirectory
    action="list"
    name="images"
    directory="/home/trovich/images/"
    filter="*.png|*.jpg|*.jpeg|*.gif"
    recurse="true" />

<cfloop query="images">

    <!--- change delimiter to \ if you're on windoze --->
    <cfset ipath = images.directory & "/" & images.name />

    <cffile action="readbinary" file="#ipath#" variable="binimage" />

    <!---
        This is slow as hell with any encoding!
        <cfset checksum = BinaryEncode(binimage, "Base64") />
     --->

    <cfset checksum = digestUtils.md5hex(binimage) />

    <cfif StructKeyExists(checksums, checksum)>

        <!--- init cache using original on 1st position when duplicate found --->
        <cfif NOT StructKeyExists(duplicates, checksum)>
            <cfset duplicates[checksum] = [] />
            <cfset ArrayAppend(duplicates[checksum], checksums[checksum]) />
        </cfif>

        <!--- append current duplicate --->
        <cfset ArrayAppend(duplicates[checksum], ipath) />

    <cfelse>

        <!--- save originals only into the cache --->
        <cfset checksums[checksum] = ipath />

    </cfif>

</cfloop>

<cfset time = NumberFormat((getTickcount()-ticks)/1000, "._") />


<!--- render duplicates without resizing (see options of cfimage for this) --->

<cfoutput>

<h1>Found #StructCount(duplicates)# duplicates, took ~#time# s</h1>

<cfloop collection="#duplicates#" item="checksum">
<p>
    <!--- display all found paths of duplicate --->
    <cfloop array="#duplicates[checksum]#" index="path">
        #HTMLEditFormat(path)#<br/>
    </cfloop>
    <!--- render only last duplicate, they are the same image any way --->
    <cfimage action="writeToBrowser" source="#path#" />
</p>
</cfloop>

</cfoutput>

Obviously, you can easily use duplicates array to review the results and/or run some cleanup job.

Have fun!

Community
  • 1
  • 1
Sergey Galashyn
  • 6,946
  • 2
  • 19
  • 39
  • +1 likewise, that nails the hash compare part, thanks Sergii. How could this be modified to cope with more than 2 copies of the same image/file? As it stands, if you have 3 copies of the same your code identifies 2 dupes. Also how could you show an image preview outside of the web root? – Saul Jun 19 '11 at 09:27
  • @Saul Please check updated code sample, I've reworked it a bit. All instances of same image are grouped now, I agree that this is more handy for analyze. Images output is extremely simple, but price for this is that images displated in original size -- this will be `very` slow for big images and/or for bit amount of images. – Sergey Galashyn Jun 19 '11 at 10:16
  • I was thinking.. file IO and hash in the OS level would be faster and easier? – Henry Jun 19 '11 at 21:00
  • @Henry: agreed, but what's the right way to do OS level hash? Provide an example. – orangepips Jun 20 '11 at 01:27
  • @Henry It depends on what you want: solve the the task for your personal needs and have fun with CFML (btw, I think hashing is very fast using Apache lib, but rendering not), or just solve the task as efficiently as possible. For second purpose I'd use one of the great existing tools like [fslint](http://www.ubuntugeek.com/fslint-toolkit-to-fix-various-problems-with-filesystems-data.html) (for Ubuntu). – Sergey Galashyn Jun 20 '11 at 07:15
  • @Henry For hashing on OS level I could try to invoke `md5sum` with `cfexecute` and pass batch of files as argument, but I think question author uses Windows, so it would not work for him any way. – Sergey Galashyn Jun 20 '11 at 07:18
  • @Henry This may work, but I simply don't have any interest in Windows-specific solution. It is not useful for me in any way, at least for now. – Sergey Galashyn Jun 20 '11 at 11:04
0

I would recommend split up the checking code into a function which only accepts a filename.

Then use a global struct for checking for duplicates, the key would be "size" or "size_hash" and the value could be an array which will contain all filenames that matches this key.

Run the function on all jpeg files in all different directories, after that scan the struct and report all entries that have more than one file in it's array.

If you want to show an image outside your webroot you can serve it via < cfcontent file="#filename#" type="image/jpeg">

jontro
  • 10,241
  • 6
  • 46
  • 71