Use google-api/mediawiki-api to retrieve information

Question

I am currently working on a University project under the theme of "search-engine". For this purpose we were given access to a database of scientific publications (http://dblp.uni-trier.de)

It is a 2GB XML file which looks something like this:

<article key="GottlobSR96">
<author>Georg Gottlob</author>
<author>Michael Schrefl</author>
<author>Brigitte R&#246;ck</author>
<title>Extending Object-Oriented Systems with Roles.</title>
<pages>268-296</pages>
<year>1996</year>
<volume>14</volume>
<journal>TOIS</journal>
<number>3</number>
<url>db/journals/tois/tois14.html#GottlobSR96</url>
</article>

As you can see the "article"-tag contains various information such as author,title of the paper,year of publication. My job now is to implement a Java solution which takes search terms of different categories (author, university,title) as input and provides the user with additional information.

For example if you enter the name of a professor it should return data like his date of birth, the University he works at, number of publications, etc..

I suppose this would work using google api to find for a persons entry on the University homepage and then somehow parsing through the page to find the needed information. For Universities there should be a Wikipedia page.

I already tried using mediawiki api but couldn't figure out how to get only the specific information I want.(I could only get the intro paragraph)

I've never worked on a project of this scale so I don't really have a clue on how to implement foreign API's/libraries etc. into my own code. So i guess my question is:

How do i get specific information based on a google-search? May it be through wikipedia or otherwise.

Maybe solr might be something for your project: http://lucene.apache.org/solr/ Solr offers nice features like search, facets, filters, etc. — Frederic Klein, Nov 26 '17 at 20:11
It's not really clear what you are looking for - do you need to implement searching in this database that you have been given, or does that already work and you are just looking to enrich the results with external data? — Tgr, Nov 28 '17 at 17:30
@Tgr both needs to be done. But we split the work and I am now tasked with the enrichment. — Stefan Watt, Nov 30 '17 at 05:38
https://stackoverflow.com/questions/33862336/how-to-extract-information-from-a-wikipedia-infobox has some relevant information. Although I'd start for looking a database (some kind of scientific publication data) where you can get reliable identifiers for the article and the authors - DOI, ORCID, something like that. Without that a search (whether Google or Wikipedia/Wikidata) is unlikely to be useful. — Tgr, Nov 30 '17 at 05:44

Use google-api/mediawiki-api to retrieve information

0 Answers0