8

I'm trying to build the treegraph of wikipedia articles and its categories. What do I need to do that?

From this site (http://dumps.wikimedia.org/enwiki/latest/), I've downloaded:

  • enwiki-latest-page.sql.gz
  • enwiki-latest-categorylinks.sql.gz
  • enwiki-20141106-category.sql.gz

I tried followed the answer here (Wikipedia Category Hierarchy from dumps), but it doesn't seem that the categorylinks has the same schema (no pageId column).

What's the right way to build the hierarchy?

Bonus question: How can I tell which of the 35M pages in enwiki-latest-page.sql.gz are articles (supposedly about 5M according to wikipedia statistics)

Thanks

Community
  • 1
  • 1
kane
  • 5,465
  • 6
  • 44
  • 72
  • 1
    possible duplicate of [Wikipedia Category Hierarchy from dumps](http://stackoverflow.com/questions/17432254/wikipedia-category-hierarchy-from-dumps) – leo Dec 04 '14 at 10:02
  • 1
    you're absolutely right @leo. I had a private chat with the answerer and summarized it in a way that's a little more detailed and will hopefully help others like me – kane Dec 04 '14 at 18:45

2 Answers2

4

Yes, it turns out this stackoverflow answer was right. It referenced the right datasets, but I was too dense to understand how to relate them together.

Thanks to @svick for leading me through the individual steps in a private chat.

For the benefit of others, I've explicitly detailed the relationship between the data sets and the exact steps to traverse the graph in my blog, which is a summary of our private chat.

Parsing Wikipedia Page Hierarchy

Community
  • 1
  • 1
kane
  • 5,465
  • 6
  • 44
  • 72
  • BTW, that chat isn't private, [anyone can see it](http://chat.stackoverflow.com/rooms/66156/discussion-between-svick-and-kane). – svick Dec 06 '14 at 03:45
  • 1
    oh, that's good to know that we have my incompetence made public for the world to read :p But in all seriousness, thanks for your time. you were very helpful, and hopefully my blog did justice in describing concisely and accurately what you taught me so you won't have to keep answering the same question again and again – kane Dec 06 '14 at 08:00
  • We like to keep stackoverflow relatively self-contained. So just referencing a blog which might disappear some day is discouraged. Can you summarize it here? – nealmcb Dec 23 '15 at 00:23
  • I found the key steps are summarized in another answer too: https://stackoverflow.com/a/21798259/6276743 –  Jul 25 '20 at 19:35
3

I met the same problem for japanese wikipedia.

I solved this problem as follows:

  • get sql for category, categorylinks, page and import to my mysql server.
  • run the following command. You can get subcategories of '学問'.
    MariaDB [wikipedia]> select page.page_title from categorylinks join page on page.page_id = categorylinks.cl_from join category on categorylinks.cl_to = category.cat_title where categorylinks.cl_type = 'subcat' and category.cat_title like '学問';
+-----------------------------------+
| page_title                        |
+-----------------------------------+
| 学問の分野                        |
| 科学                              |
| 学問スタブ                        |
| 架空の思想・学問                  |
| 学者                              |
| 学術出版                          |
| 学術称号                          |
| 学術団体                          |
| 学生                              |
| 学派                              |
| 学問の賞                          |
| 研究                              |
| 高等教育                          |
| 知識                              |
| 問題                              |
| ルネサンス・ユマニスム            |
+-----------------------------------+
16 rows in set (0.00 sec)
niwatolli3
  • 51
  • 2
  • I wrote builder which converts wikipedia sql into category csv for japanese. I put the Dockerfile on dockerhub: https://hub.docker.com/r/niwatolli3/wikipedia-category-csv/ – niwatolli3 Nov 20 '17 at 13:21