I am trying to get all Wikipedia articles for a category and its sub categories.
I have currently figured out a minor part of the problem that is to use wiki API. For example, to look for the Category:Geography, I have used the API to find the Category of Geography:
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography&cmlimit=100
I have gotten the JSON response:
{
"batchcomplete":"",
"query":{
"categorymembers":[
{
"pageid":5883021,
"ns":14,
"title":"Category:Branches of geography"
},
{
"pageid":5782300,
"ns":14,
"title":"Category:Geography by place"
},
{
"pageid":8700702,
"ns":14,
"title":"Category:Geography awards and competitions"
},
...
]
}
}
Now my problem is how do I make use of this to make a Python script to run and collect all the articles? I have encountered another problem because for example if I enter to the first cateogry: Branches of geography it contains more categories and subcategories. How do I make a script that it will transverse all the way down till it reach the article, save it to text file and then move back up the category and collect more?