Getting the first paragraph of Wikipedia, and storing it into a text file

Question

I wanted to make a system in which we give something to be search onto the terminal of a Raspberry Pi and the Pi gives a voice output.

I've solved the text-to-speech conversion problem using pico TTS. Now what I wanted to do is go to the Wikipedia page of the term to be searched, and store the first paragraph of the page to a text file.

For example, the result for input Tiger in Simple English should make a text file containing -

The tiger (Panthera tigris) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae. It lives in Asia, mainly India, Bhutan, China and Siberia.

I tried using this but it didn't seem to work.

Error message for

$ pip install wikipedia
...
Command /usr/bin/python -c "import setuptools, tokenize;__file__='/tmp/pip-build-qdTIZY/wikipedia/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-9CPD6D-record/install-record.txt --single-version-externally-managed --compile
failed with error code 1 in /tmp/pip-build-qdTIZY/wikipedia
Storing debug log for failure in /home/pi/.pip/pip.log

I tried using this http://stackoverflow.com/questions/4460921/extract-the-first-paragraph-from-a-wikipedia-article-python — Souvik Saha, Jun 10 '16 at 10:11
@SouvikSaha What went wrong? SO is not a free work force, nobody's going to write this program from scratch for you. — polkovnikov.ph, Jun 10 '16 at 10:45
For the top answer, I couldn't get either of the modules to be imported. And for the second answer, the pip command wasn't working. @polkovnikov.ph — Souvik Saha, Jun 10 '16 at 10:48
@SouvikSaha Please, update the post with the exact error messages. I bet it has something to do with Python version. — polkovnikov.ph, Jun 10 '16 at 10:49

score 0 · Answer 1 · edited May 23 '17 at 12:32

this seems to work:

title=Tiger
n_sentences=2
curl -s http://simple.wikipedia.org/w/api.php?action=query&prop=extracts&titles="$title"&exsentences="$n_sentences"&explaintext=&format=json |
  sed 's/.*"extract":"\|"}}}}$//g'

it correctly yields:

The tiger (Panthera tigris) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae.

Also tested with title=Albert_Einstein:

Albert Einstein (14 March 1879 \u2013 18 April 1955) was a German-born theoretical physicist who developed the general theory of relativity, one of the two pillars of modern physics (alongside quantum mechanics).\nHe received the Nobel Prize in Physics in 1921, but not for relativity.

(Note that title="Albert Einstein", title=albert_einstein, and title=albert%20einstein all don't work, so you'll eventually want another command to find the best matching real simple.wikipedia article title.)

the curl command makes an http request to simple.wikipedia.org. to see this in action, try this:

curl http://simple.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Tiger&exsentences=2&explaintext=&format=json

the sed command then extracts the desired part of the response.

updated to increase chance of working with raspberry's curl & sed: changed https to http and rewrote sed command without -e.

ref:

MediaWiki API?

Can you please tell me how to use this exactly? I tried running it as a bash script and it doesn't seem to be giving an output @webb — Souvik Saha, Jun 11 '16 at 12:09

Getting the first paragraph of Wikipedia, and storing it into a text file

1 Answers1