6

I'm looking for a way to get all the movies on English Wikipedia, with their creation date. A movie for me is a page with IMDB ID attached to it. So, this is my query so far:

SELECT DISTINCT ?item_label ?imdb_id  (year(?dateCreation) as ?AnneeCreation) WHERE {
      ?item wdt:P345 $imdb_id.
      filter STRSTARTS(?imdb_id,"tt")
  OPTIONAL{
    ?item wdt:P571 ?dateCreation.           
  }

      SERVICE wikibase:label { bd:serviceParam wikibase:language "en".}  
      ?article schema:about ?item ; schema:isPartOf <https://en.wikipedia.org/> ; schema:name ?item_label
    }

The problem with this is that most of the pages don't have a P571 property, So I was wondering whether there is a better way to get the creation date? Maybe by the revision history or something, I couldn't find such an option.

Any help will be appreciated!

logi-kal
  • 7,107
  • 6
  • 31
  • 43
Binyamin Even
  • 3,318
  • 1
  • 18
  • 45
  • `wdt:P571` is a property for the creation date of the entity itself, not about their corresponding Wikipedia article creation. As far as I can see, there is no such information in Wikidata, at least if you check by `?article ?p ?o .` triple pattern, you won't get anything w.r.t. that kind of meta-data. – UninformedUser Jan 05 '20 at 12:09
  • the only way where I can see those information is via the Wikidata `bd:serviceParam wikibase:api "Generator" .` [service](https://www.mediawiki.org/wiki/API:Query#Generators). You could try to use this in the SPARQL query. – UninformedUser Jan 05 '20 at 12:26
  • at least you could get something like last touched date. just a sketch: `SELECT ?item ?pageid ?date WHERE { ?item wdt:P31 wd:Q11424 . ?item wdt:P345 $imdb_id. SERVICE wikibase:mwapi { bd:serviceParam wikibase:endpoint "en.wikipedia.org" . bd:serviceParam wikibase:api "Generator" . bd:serviceParam mwapi:generator "search" . bd:serviceParam mwapi:gsrsearch "hastemplate:\"infobox film\"" . bd:serviceParam mwapi:gsrlimit "max" .` – UninformedUser Jan 05 '20 at 12:40
  • `bd:serviceParam wikibase:limit 50 . bd:serviceParam mwapi:gcmprop "ids|title|type|touched" . ?item wikibase:apiOutputItem mwapi:item . ?pageid wikibase:apiOutput "@pageid" . ?ns wikibase:apiOutput "@ns" . ?date wikibase:apiOutput "@touched" . } } LIMIT 15` – UninformedUser Jan 05 '20 at 12:40
  • You won't find creation dates on Wikidata as it is a general data repository, not about Wikipedia articles. Are you sure only movies have imdb ids? The only way I can see to find creation dates programmatically is via https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Jaws%20(film)&rvlimit=1&rvprop=timestamp&rvdir=newer but that only works for a single title per call (despite "titles" the "rv" parameters only work when a single title is supplied). – smartse Jan 07 '20 at 01:11
  • 1
    More clarity about what you're trying to do, and potentially why, would help you get better answers. If you really do want Wikipedia page creation dates, this appears to be available only through each Wikipedia page's history page, which is not currently -- but could be made part of -- the DBpedia data extraction for each Wikipedia page. If you really want the creation date for the subject of the Wikipedia page -- i.e., the movie -- that's just a different Wikidata property. – TallTed Jan 07 '20 at 18:32
  • https://w.wiki/FZe – Stanislav Kralin Jan 16 '20 at 00:28

1 Answers1

3

So, as the comments have noted, Wikidata properties (with some rare examples like featured-article flags) describe the underlying concept, not the Wikipedia page metadata. There is some limited ability to talk to the Wikipedia API as @AKSW points out, but my understanding is that this doesn't work very well for large numbers of articles (note the example code has a LIMIT 50 in it)

However, all is not lost! I worked out a methodology to do this at scale for very large numbers of articles recently in Gender and Deletion on Wikipedia, using a bit of lateral thinking.

First step: figure out your Wikidata query. tt-prefixed IMDB tags may apply to things other than films (eg TV episodes, sports broadcasts), so another approach might be to do a P31/P279 type/class search to find all things that are "films, or subclasses of films". You will also want to add a filter that explicitly says "and only has an article in English Wikipedia", which I see you've already done. Note that this gives you the name of the WP article, not the "label" of the Wikidata item, which is distinct, so you can drop the (time-consuming) label service clause. You'll end up with something like https://w.wiki/FH4 (this still uses the tt- prefix approach and gets 180k results) or https://w.wiki/FH8 (P31/P279 filter plus tt- prefix, 136k results)

Run this query, save the results TSV somewhere, and move on to step 2. The tool we will use here is PetScan, which is designed to link up data from Wikipedia categories, Wikipedia metadata, Wikidata queries, etc.

Feed the SPARQL query into tab 4 ("Other sources") and say "Use wiki: enwiki" at the bottom of this tab. This will force it to output data on the Wikipedia articles linked from this query.

Now hit "do it", wait a little while, (it took ~100s when I tested it) and examine the results. You will see that we get title (the WP article), page ID, namespace (hopefully always "(Article)", size in bytes, and last-touched date. None of these are creation date...

...except one of them kind of is. PageIDs are assigned sequentially, so they are essentially time-of-creation timestamps. There are some nuances here about edge cases - eg if I created a redirect called "Example (film)" in 2010, and in 2015 manually edited the redirect to become a real article called "Example (film)", it would show up as created in 2010. There may also be odd results for pages deleted and recreated, or ones that have had complicated page-move histories (straightforward page moves should maintain IDs, though). But, in general, for 95% of items, the pageID will reflect the time at which it was first created onwiki. For example, 431900000 was created at 11.14am on 1 July 2014; 531900000 was created at 6.29pm on 14 February 2017; and so on.

Back to PetScan - let's pull down all these items. In PetScan, go to the last tab and select TSV. Re-run the search and save the resulting file.

Now, we have one TSV with Wikidata IDs, IMDB IDs, and WP page titles (plus anything else you want to recover from WD queries); we have another with WP page titles and page IDs. You can link them together using WP page titles, letting you go from "results in Wikidata" to "page ID". Clean these up and link them however you prefer - I did it in bash, you might want to use something more sensible like python.

Now you can convert PageID to creation date. For the work I did I was only interested in six-month bins so I just worked out an arbitrary pageID created on 1 January and 1 July each year, and counted IDs between them. You could do the same thing, or use the API to look up individual pageIDs and get creation timestamps back - depends exactly what you're wanting to get.

This is all a bit more complicated than just using the query service, and it will ''probably'' give spurious results for one or two articles with complicated histories, but it will basically let you do what you originally asked for.

Andrew is gone
  • 286
  • 1
  • 5