19

Using Wikipedia's dumps I want to build a hierarchy for its categories. I have downloaded the main dump (enwiki-latest-pages-articles) and the category SQL dump (enwiki-latest-category). But I can't find the hierarchy information.

For example, the SQL categories' dump has entries for each category but I can't find anything about how they relate to each other.

The other dump (latest-pages-articles) says the parent categories for each page but in an unordered way. It just states all the parents.

I have seen wikiprep's category hierarchy (http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/)... How is that one constructed? Wikiprep lists the category ID, not its name. Is there a way to get the name for each ID?

fersarr
  • 3,399
  • 3
  • 28
  • 35

2 Answers2

17

The category hierarchy information in MediaWiki is stored in the categorylinks table, so you're going to need the categorylinks dump.

You're also going to need the page (not pages-articles) dump for page id to title mapping.

Nemo
  • 2,441
  • 2
  • 29
  • 63
svick
  • 236,525
  • 50
  • 385
  • 514
  • Thanks! Been looking for that all night! When you said "page" you mean this one enwiki-latest-page.sql.gz? (http://dumps.wikimedia.org/enwiki/latest/) – fersarr Jul 03 '13 at 09:09
  • 1
    @fersarr Yeah, that's the one. – svick Jul 03 '13 at 09:21
  • sorry for bothering again with this theme, I am working on it, but not getting what I expected as a result. Is this correct: From categoryLinks I get the pageId and it's categories. Some pages will also be categories, so connecting all links should result in a hierarchy of categories? – fersarr Jul 10 '13 at 19:00
  • 1
    I'm trying to do the same thing, but maybe the categorylinks schema has changed. It no longer has a pageId column. categorylinks now has [cl_from,cl_to,cl_sortkey,cl_timestamp,cl_sortkey_prefix,cl_collation,cl_type]. How can I build the hierarchy from this? – kane Dec 03 '14 at 18:59
  • @kane If you look at [the documentation for `categorylinks`](https://www.mediawiki.org/wiki/Manual:Categorylinks_table), you'll see that the `page_id` is stored in the `cl_from` column (and always was). – svick Dec 03 '14 at 19:38
  • @svick thank you. So categorylinks.cl_from = page.page_id. Where do I go from there? Let's take, for example, page_id=12, which is "Anarchism". There are 19 categorylinks where cl_from=12. The cl_to is a text field [Anarchism, Anti-capitalism,Anti-fascism,...]. How do I find the parent or child categories/pages? – kane Dec 03 '14 at 19:51
  • @kane Find the `page_id` for the page with `page_title = 'Category:Anarchism'` and then look that up in `categorylinks` etc. – svick Dec 03 '14 at 19:53
  • @svick Am I suppose to look up the cl_to values in category.cat_title? The issue is not all of the cl_to values are in category.cat_title. For example, there is a category.cat_title=Anarchism but none for Anti-capitalism. And it just seems odd they didn't list the category.cat_id instead – kane Dec 03 '14 at 19:55
  • @svick In my 34M page table, there is not page_title='Category:Anarchism'. There are 9 entries where page_title='Anarchism' however. Do I have the wrong page table, maybe? – kane Dec 03 '14 at 19:57
  • @svick In fact, I only have 12 entries in page where page_title LIKE 'Category:%' – kane Dec 03 '14 at 19:59
  • @kane 34MB? page.sql.gz for the English Wikipedia is 1 GB. Maybe you downloaded it for different wiki? And you don't need the `category` table at all (notice that I never mentioned it), and especially not the `cat_id` (it's not related to `page_id`). – svick Dec 03 '14 at 20:03
  • 2
    Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/66156/discussion-between-svick-and-kane). – svick Dec 03 '14 at 20:04
  • @fersarr: Could you kindly let me know how you created the hierarchy? – HHH Jul 15 '15 at 23:50
  • Is there a way to obtain all the subcategories and pages from a Category using those dumps, or do I need more sql tables? @svick – iamdeit Oct 23 '16 at 06:43
  • 1
    Since the OP has closed the chat.. here's how you can determine which categories are hidden :- "The status of hidden categories is stored in the page props table as the property "hiddencat" in pp_propname" P.S. Wikimedia has excellent description of all its tables :- https://www.mediawiki.org/wiki/Category:MediaWiki_database_tables – Shatu Feb 20 '17 at 06:52
2

Loading the dump of category links etc... to build a wikipedia hierarchy is very long (even if interesting).

I found fast path that give good result. I rely on wikipedia vital articles hierarchy. See for instance, sensimark for an example use.

amirouche
  • 7,682
  • 6
  • 40
  • 94