- Retrieve page source code, https://en.wikipedia.org/w/index.php?title=Google&action=raw
- Scan it for substrings like
[[File:Google web search.png|thumb|left|On February 14, 2012, Google updated its homepage with a minor twist. There are no red lines above the options in the black bar, and there is a tab space before the "+You". The sign-in button has also changed, it is no longer in the black bar, instead under it as a button.]]
- Ask API for all pictures on page, http://en.wikipedia.org/w/api.php?action=query&titles=Google&generator=images&gimlimit=10&prop=imageinfo&iiprop=url|dimensions|mime&format=json
- Filter out urls but those which match picture names found in step 2.
Steps 2 and 4 need more explanation.
@2. Regexp /\b(File|Image):[^]|\n\r]+/
should be enough. In Ruby's regexps, \b
denotes word boundary which might be unsupported in language of your choice. Regexp I proposed will match all cases which come to my mind: [[File:something.jpg]]
, gallery tags: <gallery>\nFile:one.jpg\nFile:two.jpg\n</gallery>
, templates: {{Infobox|pic = File:something.jpg}}
. However, it won't match filenames which contain ]
. I'm not sure if they're legal, but if they are, they must be very uncommon and it should not be a big deal.
If you want to match only constructs like this: [[File:something.jpg|thumb|description]]
, following regexp will work better: /\[\[(File|Image):[^]|]+/
@4. I'd remove all characters from names which match /[^A-Za-z0-9]/
. It's easier than escaping them and, in most cases, enough.
Icons are most often attached in templates, contrary to pictures related to article subject, which are most often attached directly ([[File:…]]
). There are exceptions though, for example in some articles pictures are attached with {{Gallery}} template. There is also <gallery>
tag which introduces special syntax for galleries. You got to tune my solution to your needs, and even then it won't be perfect, but it should be good enough.