2

Is it possible to create an input field where you can paste a Wikipedia page link and it will get all the text contents from that page?

I'm trying to integrate a feature on my web application where people can paste their Wikipedia page link/URL they want to analyze on the input field. And the application will use that URL to get all the text content from that page.

Suppose the user inputs this link: https://en.wikipedia.org/wiki/Taylor_Swift

The application will return the text content of that page, like this:

Taylor Alison Swift (born December 13, 1989) is an American singer-songwriter. Her narrative songwriting, which often centers around her personal life, has received widespread media coverage. Born in West Reading, Pennsylvania, Swift relocated to Nashville, Tennessee in 2004 to pursue a career in country music. At age 14, she became the youngest artist signed by the Sony/ATV Music publishing house, and at age 15, she signed her first record deal. Her 2006 eponymous debut studio album was the longest-charting album of the 2000s on the Billboard 200. Its third single, "Our Song", made her the youngest .......

Also, I've tried this api, which works, but it just returns the header content, not the whole page content

I've gone through Wikipedia API and found none (yet). Any suggestions on how I do this?

Prottay Rudra
  • 187
  • 2
  • 9
  • Does this answer your question? [How to get Wikipedia content using Wikipedia's API?](https://stackoverflow.com/questions/7185288/how-to-get-wikipedia-content-using-wikipedias-api) – ViktorG Aug 10 '20 at 18:00
  • https://en.wikipedia.org/api/rest_v1/#/Page%20content Found by going to Wikipedia in my language, click on Developers in the footer, Web APIs under Get Started, then read the page. – Heretic Monkey Aug 10 '20 at 18:01
  • I have edited with example demonstrating what I want – Prottay Rudra Aug 10 '20 at 18:03
  • Also, I've tried this [api](https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&titles=Taylor_Swift), which works, but it just returns the header content, not the whole page content – Prottay Rudra Aug 10 '20 at 19:50
  • Refer to the API specs [here](https://en.wikipedia.org/api/rest_v1/#/Page%20content/get_page_html__title_) in the method `/page/html/{title}` which gets you the HTML for a title. Checkout the instructions given for that method. The HTML that you get in the response is same as [this](https://en.wikipedia.org/api/rest_v1/page/html/Taylor_Swift?redirect=true)which is a rendered version of the response – Saiprasad Balasubramanian Aug 10 '20 at 20:10

2 Answers2

3

Since you tagged node.js in your question, I'm assuming you are using Javascript. You could use an npm library called wikijs

An example from wikijs page

wiki({ apiUrl: 'https://es.wikipedia.org/w/api.php' })
    .page('Cristiano Ronaldo')
    .then(page => page.info())
    .then(console.log);

Hope this works for you

  • 1
    @ProttayRudra I checked out your edited question. If a user is entering a link https://en.wikipedia.org/wiki/Taylor_Swift then you need to extract the page_title from the URL and then query either by using the official API's or wikijs or any other library. The response would be in a machine-readable format and not text directly. You'll need to clean the HTML response and get the text you need – Saiprasad Balasubramanian Aug 10 '20 at 18:13
  • Any idea on how do I extract the page title from the URL? – Prottay Rudra Aug 10 '20 at 18:18
  • A very basic way would be to split the URL at `wikipedia.org/wiki` section like this `url.split('wikipedia.org/wiki/')[1]` and get the page title. In the case of Taylor Swift's link you mentioned, you'll get the output of `Taylor_Swift`. If the API doesn't consider this valid then replace the `_` with a space like this `b=a.split('wikipedia.org/wiki/')[1].replace('_', ' ')` – Saiprasad Balasubramanian Aug 10 '20 at 18:25
1

You can use this API from Mediawiki to get the text of the article without any format:

https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exlimit=max&explaintext&titles=Taylor_Swift

It's actually the same API you mentioned in your question, but the only difference that you should remove &exintro parameter, and add these two parameters instead &exlimit=max&explaintext

ASammour
  • 865
  • 9
  • 12