0

I have ~3,600 html files with a ton of image tags in them. I'd like to be able to capture all the src attribute values used in these files and aggregate them into a text file where I can then remove duplicates and see how many unique image filenames there are overall.

I use BBEdit and I can easily use regex and multi-file search to find all the image references (18,673), but I don't want to replace them with anything -- instead, I want to capture them from the BBEdit search results 'Notes' and push them into another file.

Is this something that can be AppleScripted? Or are there other means to the same end that would be appropriate?

  • The reason why I want to do this is I have a site with around 15,000 images in its image database of which a significant number are almost certainly redundant and need to be purged. I'd like to build a picture of where the redundancy is and its overall extent. – Jonathan Schofield Mar 14 '12 at 17:32

1 Answers1

1

You've got a tall task there because there's many parts of this you have to solve. To give you a start, here's some advice on reading one html file and putting all the src images in an applescript list. You have to do much more than that but this is a beginning.

First you can read a html file into applescript as regular text. Something like this will get the text of one html file...

set theFile to choose file
set htmlText to read theFile

Once you have the text into applescript you could use text item delimiters to grab the src images. Here's an example. It should work no matter how complex the html code...

set htmlText to "<img src=\"smiley.gif\" alt=\"Smiley face\" height=\"42\" width=\"42\" />
<img src=\"smiley.gif\" alt=\"Smiley face\" height=\"42\" width=\"42\" />
<img src=\"smiley.gif\" alt=\"Smiley face\" height=\"42\" width=\"42\" />"

set text item delimiters to "src=\""
set a to text items of htmlText
if (count of a) is less than 2 then return

set imageList to {}
set text item delimiters to "\""
repeat with i from 2 to count of a
    set thisImage to first text item of (item i of a)
    set end of imageList to thisImage
end repeat

set text item delimiters to ""
return imageList

I hope that helps!

regulus6633
  • 18,848
  • 5
  • 41
  • 49