I want to create an algorithm for a bash script or a mini java program that will be able to go online to harvest the Wikipedia article of a query I submit to it, and it will convert it to either text file or PDF or any format readable; from HTML format.
-
Use the [API](https://www.mediawiki.org/wiki/API:Main_page). Why scrape HTML when you don't need to? – Boris the Spider Nov 12 '17 at 09:44
-
How, please kindly explain, I don't even know what an API is. – Olalekan Adebari Nov 12 '17 at 09:46
-
I have something for you to reap: https://en.wikipedia.org/wiki/Application_programming_interface . Also, this: https://stackoverflow.com/questions/627594/is-there-a-wikipedia-api – James Brown Nov 12 '17 at 09:54
-
OK, thanks; now how do I get the API. – Olalekan Adebari Nov 12 '17 at 09:57
-
1@OlalekanAdebari Boris provided you with a link to the API you can read the documentation there or simply try to search on youtube "How to use WIKI API", you will get plenty of examples. – Mark Davydov Nov 12 '17 at 10:05
1 Answers
this is web scraping. you can automate browser actions, there are several libraries for this. in java there is Jaunt
( http://jaunt-api.com/jaunt-tutorial.htm ) in python there is webbrowser
, Request
, Beautiful Soup
and Selenium
( https://automatetheboringstuff.com/chapter11/ )
In wikipedia there is the Download as PDF
option on the left side you can automate a browser to click this and download the generated pdf
in the wikipedia source code this is ElectronPdf
in
<li id="coll-create_a_book"><a href="/w/index.php?title=Special:Book&bookcmd=book_creator&referer=Web+scraping">Create a book</a></li><li id="coll-download-as-rdf2latex"><a href="/w/index.php?title=Special:ElectronPdf&page=Web+scraping&action=show-download-screen">Download as PDF</a></li><li id="t-print"><a href="/w/index.php?title=Web_scraping&printable=yes" title="Printable version of this page [p]" accesskey="p">Printable version</a></li>
if you just want the html of a wikipedia side you can simply HTTP GET as described in How do I do a HTTP GET in Java?

- 9,149
- 4
- 22
- 34