I'm trying to build the treegraph of wikipedia articles and its categories. What do I need to do that?
From this site (http://dumps.wikimedia.org/enwiki/latest/), I've downloaded:
- enwiki-latest-page.sql.gz
- enwiki-latest-categorylinks.sql.gz
- enwiki-20141106-category.sql.gz
I tried followed the answer here (Wikipedia Category Hierarchy from dumps), but it doesn't seem that the categorylinks has the same schema (no pageId column).
What's the right way to build the hierarchy?
Bonus question: How can I tell which of the 35M pages in enwiki-latest-page.sql.gz are articles (supposedly about 5M according to wikipedia statistics)
Thanks