I have searched around but not gotten much help. Here's my problem. I want to start from a portal page on wikipedia, say Computer_science and go its categories pages. There are some pages in that category and there are links to subcategories. I will visit some of these pages and get the page abstracts alone. Then go to the next level with pointers from this categories page and so on.
I know C++/php/js/python. Which fits best here? I'd like to do this in a day. I know there's an api, but it doesn't seem helpful for getting content.
- I need to get pages
- Parse them to get to the categories div (or element as provided by raw wiki data) for getting the abstracts as well as going to other pages.
I need suggestions for programming languages, libraries and public code if available. I also heard wiki don't like bot crawlers, I am planning to get may 500 docs at most. Is that a problem?
Thanks a lot