I want to create a corpus of biology-related articles from Wikipedia, so I could analyze it later using NLP approaches. I have downloaded a Wikipedia dump, and saved it in JSON format.
I am struggling with the task of extracting the biology-related articles. While I was able to find all the articles that are listed under the category "Biology" using the method described here, It turned out that only about 20 articles are listed directly under this category. I believe that I would be more lucky if I try to extract all the articles that belong to the biology portal, but I don't know how to do such a thing. Is there any method to extract articles that belong to a certain portal?