-3

I want to create an algorithm for a bash script or a mini java program that will be able to go online to harvest the Wikipedia article of a query I submit to it, and it will convert it to either text file or PDF or any format readable; from HTML format.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245

1 Answers1

0

this is web scraping. you can automate browser actions, there are several libraries for this. in java there is Jaunt ( http://jaunt-api.com/jaunt-tutorial.htm ) in python there is webbrowser, Request, Beautiful Soup and Selenium ( https://automatetheboringstuff.com/chapter11/ )

In wikipedia there is the Download as PDF option on the left side you can automate a browser to click this and download the generated pdf

in the wikipedia source code this is ElectronPdf in <li id="coll-create_a_book"><a href="/w/index.php?title=Special:Book&amp;bookcmd=book_creator&amp;referer=Web+scraping">Create a book</a></li><li id="coll-download-as-rdf2latex"><a href="/w/index.php?title=Special:ElectronPdf&amp;page=Web+scraping&amp;action=show-download-screen">Download as PDF</a></li><li id="t-print"><a href="/w/index.php?title=Web_scraping&amp;printable=yes" title="Printable version of this page [p]" accesskey="p">Printable version</a></li>

if you just want the html of a wikipedia side you can simply HTTP GET as described in How do I do a HTTP GET in Java?

ralf htp
  • 9,149
  • 4
  • 22
  • 34