Bash script to reap wikipedia article and convert it to text file

Question

I want to create an algorithm for a bash script or a mini java program that will be able to go online to harvest the Wikipedia article of a query I submit to it, and it will convert it to either text file or PDF or any format readable; from HTML format.

Use the [API](https://www.mediawiki.org/wiki/API:Main_page). Why scrape HTML when you don't need to? — Boris the Spider, Nov 12 '17 at 09:44
How, please kindly explain, I don't even know what an API is. — Olalekan Adebari, Nov 12 '17 at 09:46
I have something for you to reap: https://en.wikipedia.org/wiki/Application_programming_interface . Also, this: https://stackoverflow.com/questions/627594/is-there-a-wikipedia-api — James Brown, Nov 12 '17 at 09:54
@OlalekanAdebari Boris provided you with a link to the API you can read the documentation there or simply try to search on youtube "How to use WIKI API", you will get plenty of examples. — Mark Davydov, Nov 12 '17 at 10:05

score 0 · Answer 1 · answered Nov 12 '17 at 10:09

this is web scraping. you can automate browser actions, there are several libraries for this. in java there is Jaunt ( http://jaunt-api.com/jaunt-tutorial.htm ) in python there is webbrowser, Request, Beautiful Soup and Selenium ( https://automatetheboringstuff.com/chapter11/ )

In wikipedia there is the Download as PDF option on the left side you can automate a browser to click this and download the generated pdf

in the wikipedia source code this is ElectronPdf in <li id="coll-create_a_book"><a href="/w/index.php?title=Special:Book&bookcmd=book_creator&referer=Web+scraping">Create a book</a></li><li id="coll-download-as-rdf2latex"><a href="/w/index.php?title=Special:ElectronPdf&page=Web+scraping&action=show-download-screen">Download as PDF</a></li><li id="t-print"><a href="/w/index.php?title=Web_scraping&printable=yes" title="Printable version of this page [p]" accesskey="p">Printable version</a></li>

if you just want the html of a wikipedia side you can simply HTTP GET as described in How do I do a HTTP GET in Java?

Bash script to reap wikipedia article and convert it to text file

1 Answers1