4

I'd want to extract all full urls of images of "Google"'s page on Wikipedia

I have tried with:

http://en.wikipedia.org/w/api.php?action=query&titles=Google&generator=images&gimlimit=10&prop=imageinfo&iiprop=url|dimensions|mime&format=json

but, in this way, I got also not google-related images, such as:

http://upload.wikimedia.org/wikipedia/en/a/a4/Flag_of_the_United_States.svg
http://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg
http://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg
http://upload.wikimedia.org/wikipedia/commons/f/fe/Crystal_Clear_app_browser.png

How can I extract just only images that I see on Google page

sparkle
  • 7,530
  • 22
  • 69
  • 131
  • 2
    But those images are on that page about Google, don't you see them? – Bergi Dec 17 '12 at 23:42
  • Ok, I mean only those images in squared boxes that wrap text – sparkle Dec 18 '12 at 00:04
  • For that, I think you will have to parse the source code of the page. – svick Dec 18 '12 at 00:22
  • It should be a usual thing. I wonder why Wikipedia API doesn't provide it. It's Wkipedia, come on! – sparkle Dec 18 '12 at 00:30
  • No, you can't. It's the MediaWiki API, what did you expect? :-/ Image extraction is a complicated thing. Check out http://dbpedia.org – Bergi Dec 18 '12 at 00:49
  • 1
    @user1028100: "It should be a usual thing. (…)" — No, it should not. For reader, those pictures are obviously different thing. But that difference comes from how the MediaWiki software is used by people, not from how this software handles those pictures. That's why API can't tell which are pictures which are related to article and which are just decorative icons. – skalee Dec 18 '12 at 19:39

1 Answers1

6
  1. Retrieve page source code, https://en.wikipedia.org/w/index.php?title=Google&action=raw
  2. Scan it for substrings like [[File:Google web search.png|thumb|left|On February 14, 2012, Google updated its homepage with a minor twist. There are no red lines above the options in the black bar, and there is a tab space before the "+You". The sign-in button has also changed, it is no longer in the black bar, instead under it as a button.]]
  3. Ask API for all pictures on page, http://en.wikipedia.org/w/api.php?action=query&titles=Google&generator=images&gimlimit=10&prop=imageinfo&iiprop=url|dimensions|mime&format=json
  4. Filter out urls but those which match picture names found in step 2.

Steps 2 and 4 need more explanation.

@2. Regexp /\b(File|Image):[^]|\n\r]+/ should be enough. In Ruby's regexps, \b denotes word boundary which might be unsupported in language of your choice. Regexp I proposed will match all cases which come to my mind: [[File:something.jpg]], gallery tags: <gallery>\nFile:one.jpg\nFile:two.jpg\n</gallery>, templates: {{Infobox|pic = File:something.jpg}}. However, it won't match filenames which contain ]. I'm not sure if they're legal, but if they are, they must be very uncommon and it should not be a big deal.

If you want to match only constructs like this: [[File:something.jpg|thumb|description]], following regexp will work better: /\[\[(File|Image):[^]|]+/

@4. I'd remove all characters from names which match /[^A-Za-z0-9]/. It's easier than escaping them and, in most cases, enough.

Icons are most often attached in templates, contrary to pictures related to article subject, which are most often attached directly ([[File:…]]). There are exceptions though, for example in some articles pictures are attached with {{Gallery}} template. There is also <gallery> tag which introduces special syntax for galleries. You got to tune my solution to your needs, and even then it won't be perfect, but it should be good enough.

skalee
  • 12,331
  • 6
  • 55
  • 57
  • 1
    Maybe in step 2, search just for `File:name.ext`? That way, even galleries and other templates would work. – svick Dec 18 '12 at 19:52
  • @svick: Picture name may contain spaces and dots (not only to denote extensions). I have no idea how to write good regexp without checking its surrounding. – skalee Dec 18 '12 at 20:05
  • @svick: Modified according to your suggestions. – skalee Jan 16 '13 at 08:44