I have a large (50K+ pages) Mediawiki wiki, and I need to efficiently get a list of all pages, sorted by last update time. I'm working in Python using pywikibot. The documentation hints that this is possible, but I haven't decoded how to do it yet. (I can download up to 500 pages easily enough.) Is there a reasonably efficient way to do this that's better than downloading batches of 500 in alphabetic order, getting update times page by page, and merging the batches?
2 Answers
MediaWiki does not directly expose a list of pages sorted by last edit time. You could just download all pages and sort them locally (in Python or in some kind of database, depending on how many pages there are):
site = pywikibot.Site()
for namespace in site.namespaces():
for page in site.allpages(namespace = namespace):
// process page.title() and page.editTime()
or use the allrevisions API which can sort by time but returns all revisions of all pages, maybe by relying on a query like action=query&generator=allrevisions&prop=revisions
(with pywikibot.data.api.QueryGenerator
) which would also return the current revision of each page so you can discard old revisions; or use SQL support in Pywikibot with a query like SELECT page_ns, page_title FROM page JOIN revision ON page_latest = rev_id ORDER BY rev_timestamp
(which will result in an inefficient filesort-based query, but for a small wiki that might not matter).

- 27,442
- 12
- 81
- 118
-
Does that produce a list of pages sorted by time of last update? – Mark Olson Jan 20 '20 at 13:31
-
Uh, sorry, I wasn't reading the question carefully. I don't think there's a way to do that (not even within MediaWiki; I don't think it has the right indexes for efficient sorting by last edit timestamp). The [allrevisions API](https://www.mediawiki.org/w/api.php?action=help&modules=query%2Ballrevisions) can sort by time, but it will give you all revisions in the wiki, not just the ones which are the latest revision of a page, so you'd have to do some kind of filtering... – Tgr Jan 20 '20 at 20:51
-
...plus it would take very long, since there are a lot more revisions than pages. So you are probably not any better off than with the approach described in the question. – Tgr Jan 20 '20 at 20:55
-
If it's your own wiki, your best bet is probably using pywikibot's [SQL capabilities](https://www.mediawiki.org/wiki/Manual:Pywikibot/MySQL). For 50K pages query efficiency probably won't really matter. – Tgr Jan 20 '20 at 20:59
-
Actually, the code I posted below does the job fast enough and the sorting is done on the downloaded data. The *only* issue seems to be how far back the change data goes, and on that I've found no hint of documentation. – Mark Olson Jan 20 '20 at 23:39
After some digging and a lot of experimenting, I found a solution using pywikibot which generates a list of all pages sorted by time of last update:
wiki=pywikibot.Site()
current_time = wiki.server_time()
iterator=wiki.recentchanges(start = current_time, end=current_time - timedelta(hours=600000)) # Not for all time, just for the last 60 years...
listOfAllWikiPages=[]
for v in iterator:
listOfAllWikiPages.append(v)
# This has an entry for each revision.
# Get rid of the older instances of each page by creating a dictionary which
# only contains the latest version.
temp={}
for p in listOfAllWikiPages:
if p["title"] in temp.keys():
if p["timestamp"] > temp[p["title"]]["timestamp"]:
temp[p["title"]]=p
else:
temp[p["title"]]=p
# Recreate the listOfAllWikiPages from the de-duped dictionary
listOfAllWikiPages=list(temp.values())

- 136
- 2
- 9
-
This will only list pages in the recentchanges table (typically pages changed in the last 30 days). – Tgr Jan 20 '20 at 06:10
-
I'm not seeing that behavior. When I execute the code listed, the earliest change is 11/3/19, which is the date that the wiki was created. Can you help me by citing documentation of the age limitation? – Mark Olson Jan 20 '20 at 13:57
-
The relevant configuration setting is [`$wgRCMaxAge`](https://www.mediawiki.org/wiki/Manual:$wgRCMaxAge). Maybe the wiki is misconfigured, and the periodic maintenance scripts which would purge the data do not run? – Tgr Jan 20 '20 at 20:41
-
Or rather, the wiki's creation date is still within the retention period (which actually defaults to 90 days, not 30, I misremembered). – Tgr Jan 20 '20 at 20:56
-
@Tgr I'd still love to get a reference to the documentation -- What I've found so far is both bare-bones and pretty cryptic. – Mark Olson Jan 20 '20 at 23:37
-
That *was* a reference to the documentation. What, specifically, are you looking for? – Tgr Jan 21 '20 at 05:34