0

I have searched extensively for a VBscript answer to this, but have given up and need help.

What I'm trying to accomplish is to find obviously duplicate files (obvious to humans, anyway) with different filenames. I need to delete the duplicates, keeping those WITHOUT the track number in the name. I also need to delete any M4A versions if I already have it in MP3.

Is this even possible? I have done a little VBscripting, but this is way over my limited programming ability. I'm not even going to bother copying here the code that I have tried because none of it is working.

Here's a sample folder I'm trying to clean up. I want only the two unique songs in here to remain. I only want the MP3 version, and I don't want the track numbers in their names.

07 Falling In Love (Is Hard On The K.mp3
1-15 Love In An Elevator.m4a
1-15 Love In An Elevator.mp3
15 Love In An Elevator.mp3
2-07 Falling In Love (Is Hard On The.m4a
2-07 Falling In Love (Is Hard On The.mp3
Falling In Love (Is Hard On The Knees).mp3
Love In An Elevator.mp3

Thanks!

Ansgar Wiechers
  • 193,178
  • 25
  • 254
  • 328

1 Answers1

0

This is no simple task, since basically you want measure the similarity/proximity of different file names. My layman approach would be to extract the title from the file name, normalize it, and then use the shortest left-based match for comparing them. Something like this might work:

Set fso = CreateObject("Scripting.FileSystemObject")

Set re = New RegExp
re.Pattern = "^\d+(-\d+)?\s+"

Set rs = CreateObject("ADOR.Recordset")
rs.Fields.Append "NormalizedName", 200, 255
rs.Fields.Append "Length", 3
rs.Fields.Append "Path", 200, 255
rs.Open

' Store the full paths of the files and their associated normalized name in
' a disconnected recordset. The "Length" field is used for sorting (see below).
For Each f In fso.GetFolder("C:\some\folder").Files
  normalizedName = LCase(re.Replace(fso.GetBaseName(f.Name), ""))
  rs.AddNew
  rs("NormalizedName").Value = normalizedName
  rs("Length").Value = Len(normalizedName)
  rs("Path").Value = f.Path
  rs.Update
Next

' sort to ensure that the shortest normalized name always comes first
rs.Sort = "NormalizedName, Length ASC"

ref = ""
Set keeplist = CreateObject("Scripting.Dictionary")

rs.MoveFirst
Do Until rs.EOF
  path = rs("Path").Value
  name = rs("NormalizedName").Value
  currentExtension = LCase(fso.GetExtensionName(path))
  If ref <> "" And ref = Left(name, Len(ref)) Then
    ' same title as last file, so check if this one is a better match
    If extension <> "mp3" And currentExtension = "mp3" Then
      ' always pick MP3 version if it exists
      keeplist(ref) = path
      extension = currentExtension
    ElseIf extension = currentExtension _
        And IsNumeric(Left(fso.GetBaseName(keeplist(ref)), 1)) _
        And Not IsNumeric(Left(fso.GetBaseName(path), 1)) Then
      ' prefer file names not starting with a number when they have the
      ' same extension
      keeplist(ref) = path
    End If
  Else
    ' first file or different reference name
    ref = name
    extension = currentExtension
    keeplist.Add ref, path
  End If
  rs.MoveNext
Loop
rs.Close

For Each ref In keeplist
  WScript.Echo keeplist(ref)
Next

I'm pretty sure that there are edge cases not covered by the above code, so handle with care. Also note that the code processes only a single folder. For processing a folder tree additional code is required (see here for instance).

Community
  • 1
  • 1
Ansgar Wiechers
  • 193,178
  • 25
  • 254
  • 328
  • Sorry I haven't responded sooner, Ansgar, but I never received an email upon someone answering my question as I had requested. Anyway, THANKS. I'll try it out and let you know. – user2444243 Jul 27 '13 at 01:24
  • Wow, that created what looks a great list (keeplist) of the ones I'll want to keep. I'll work on scripting a deletion routine for the unwanted ones, and let you know. – user2444243 Jul 27 '13 at 01:48