-2

I want to create a corpus of biology-related articles from Wikipedia, so I could analyze it later using NLP approaches. I have downloaded a Wikipedia dump, and saved it in JSON format.

I am struggling with the task of extracting the biology-related articles. While I was able to find all the articles that are listed under the category "Biology" using the method described here, It turned out that only about 20 articles are listed directly under this category. I believe that I would be more lucky if I try to extract all the articles that belong to the biology portal, but I don't know how to do such a thing. Is there any method to extract articles that belong to a certain portal?

monte carlo
  • 95
  • 2
  • 8

2 Answers2

1

Categories are nested. For example, "Animals" ID probably a subcategory of "Biology".

You need to find all (transitive) subcategories first, then collect the documents.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
1

Wikipedia's categories are organized as a DAG, so you can traverse the tree considering the Biology category node as root and collect the associated Wiki articles. I did a similar thing before (with different intent) and sharing the GitHub repo here, it may help you.

Wasi Ahmad
  • 35,739
  • 32
  • 114
  • 161
  • Could you please explain more about your project. Does the final file at the end of this page: https://github.com/wasiahmad/mining_wikipedia/tree/master/WikiNomy contains all the categories in Wikipedia? I can't find the structure and how subcategories are determined. – user1419243 Feb 07 '18 at 14:02