creating a corpus of Wikipedia articles of specific field

Question

I want to create a corpus of biology-related articles from Wikipedia, so I could analyze it later using NLP approaches. I have downloaded a Wikipedia dump, and saved it in JSON format.

I am struggling with the task of extracting the biology-related articles. While I was able to find all the articles that are listed under the category "Biology" using the method described here, It turned out that only about 20 articles are listed directly under this category. I believe that I would be more lucky if I try to extract all the articles that belong to the biology portal, but I don't know how to do such a thing. Is there any method to extract articles that belong to a certain portal?

score 1 · Answer 1 · answered Jan 18 '18 at 06:21

1

Categories are nested. For example, "Animals" ID probably a subcategory of "Biology".

You need to find all (transitive) subcategories first, then collect the documents.

answered Jan 18 '18 at 06:21

Has QUIT--Anony-Mousse

76,138
12
138
194

score 1 · Answer 2 · answered Jan 22 '18 at 20:52

1

Wikipedia's categories are organized as a DAG, so you can traverse the tree considering the Biology category node as root and collect the associated Wiki articles. I did a similar thing before (with different intent) and sharing the GitHub repo here, it may help you.

answered Jan 22 '18 at 20:52

Wasi Ahmad

35,739
32
114
161

Could you please explain more about your project. Does the final file at the end of this page: https://github.com/wasiahmad/mining_wikipedia/tree/master/WikiNomy contains all the categories in Wikipedia? I can't find the structure and how subcategories are determined. – user1419243 Feb 07 '18 at 14:02

creating a corpus of Wikipedia articles of specific field

2 Answers2