Has anyone parsed Wiktionary?

Question

Wiktionary is a wiki dictionary that covers many languages. It even has translations. I would be interested in parsing it and playing with the data, has anyone does anything like this before? Is there any library I can use? (Preferably Python.)

http://en.wiktionary.org/wiki/Wiktionary:Parsing – Katriel Jul 29 '10 at 15:39 — Katriel, Jul 29 '10 at 15:39

score 23 · Answer 1 · answered Jul 29 '10 at 20:59

I had at one time downloaded a wiktionary dump, trying to gather together words and definitions for slavic languages. I approached it using elementtree to go thru the xml file that is the dump. I would avoid trying to scrape or crawl the site, and just download the xml dump that wikimedia provides for wiktionary. Go to the wikimedia downloads, look for the english wiktionary dumps (enwiktionary) and go to the most recent dump. You'll probably want the pages-articles.xml.bz2 file, which is just the article content, no history or comments. Parse this with whatever xml processing libraries you prefer in python. I personally prefer elementtree. Good luck.

How did you use elementtree? As far as I can see, most of the data is not xml tagged, ie, you get everything under : ==English== ===Etymology 1=== {{rfe}} ====Pronunciation==== * {{enPR|fēt}}, {{IPA|/fiːt/|lang=en}} * {{audio|en-us-feet.ogg|Audio (US)|lang=en}} * {{rhymes|iːt|lang=en}} * {{homophones|lang=en|feat}} ====Noun==== {{en-plural noun}} — zadrozny, Oct 28 '15 at 19:14

score 20 · Accepted Answer · answered Jul 29 '10 at 15:40

20

Wiktionary runs on MediaWiki, which has an API.

One of the subpages for the API documentation is Client code, which lists some Python libraries.

answered Jul 29 '10 at 15:40

Amber

507,862
82
626
550

score 15 · Answer 3 · answered Mar 16 '12 at 09:51

15

wordnik has done a good job parsing-out definitions, etc and they have a great api

like the others have mentioned, wiktionary is a formatting-disaster, and was not built to be computer-readable

answered Mar 16 '12 at 09:51

spencercooly

6,548
2
23
15

3

Thanks, wordnik works perfectly for me. I have a [thin Python client](https://github.com/jabbalaci/jabbapylib/blob/master/jabbapylib/dictionary/wordnik.py) for getting definitions and examples for a word. – Jabba Mar 29 '12 at 09:36
1

Do you recognize that the dump from wikimedia is intentionally partial? In fact, it is also maliciously partial in that the dump misses very basic and often used word while containing a lot of words that many of us don't even know exist. – InformedA Jul 20 '16 at 11:18
1

@InformedA Link for "intentionally partial", please. If you found some page which is present on the wiki but not in the dumps, have you [reported the bug](https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=Dumps-Generation)? – Nemo Apr 28 '17 at 09:32

score 10 · Answer 4 · answered Feb 13 '16 at 19:17

Yes, many people parsed Wiktionary. You can usually find past experiences in the Wiktionary-l mailing list archives.

A project not mentioned by other answers is DBPedia's Wiktionary RDF extraction.

Dozens other research projects parsed Wiktionary: you can find some examples in a recent Wiktionary special and in other issues of the Wikimedia research newsletter.

Recently someone also made an English Wiktionary REST API which includes an unspecified subset of the Wiktionary data; future plans for the thing are not known yet.

Ben Reynwar · Answer 5 · 2015-12-04T05:27:55.957

9

I had a crack at parsing the german wiktionary. I ended up writing it off as too difficult, but I put my (not at all tidied up) code up at https://github.com/benreynwar/wiktionary-parser before I gave up. Although there are conventions used by the editors they are not enforced by anything other than peer oversight. The diversity of templates used along with all the typos in the pages makes the parsing quite challenging.

I think the problem is that they've used the same system as for wiktionary which is great for ease of use by the editors, but is not appropriate for the much more structured content of wiktionary. It's a shame because if wiktionary could be easily parsed it would be a hugely useful resource.

edited Dec 04 '15 at 05:27

answered May 06 '11 at 04:52

Ben Reynwar

1,547
14
21

2

Just saw this when looking at other slashdot wiktionary questions. It might be useful. http://en.wikipedia.org/wiki/Ubiquitous_Knowledge_Processing_Lab#Wiktionary_API – Ben Reynwar May 06 '11 at 04:57
1

This project is now hosted at https://github.com/benreynwar/wiktionary-parser. It remains neglected. – Ben Reynwar Oct 18 '13 at 00:15

Andrew Krizhanovsky · Answer 6 · 2014-12-10T09:49:08.680

4

You are welcome to play with the MySQL parsed Wiktionary database. There are two databases (English Wiktionary and Russian Wiktionary) created by the parser written in Java: http://wikokit.googlecode.com

If you like PHP, then you are welcome to play with piwidict - PHP API to this machine-readable Wiktionary 2

edited Dec 10 '14 at 09:49

answered Mar 13 '14 at 13:20

Andrew Krizhanovsky

610
7
11

This may be the most hopeful option of all written thus far. +1 – BlackVegetable Sep 06 '14 at 22:46

score 4 · Answer 7 · answered Mar 24 '12 at 23:05

4

I just made a word list from the German dump like that:

bzcat pages-articles.xml.bz2 | grep '<title>[^[:space:][:punct:]]*</title>' | sed 's:.*<title>\(.*\)</title>.*:\1:' > words

answered Mar 24 '12 at 23:05

benroth

2,468
3
24
25

3

I think the question was about parsing the wiki content, not the XML. – Quentin Pradet Oct 15 '13 at 12:28

score 3 · Answer 8 · answered Jul 29 '15 at 10:18

3

You may be interested in dbnary project, not python but interesting. Claims support parsing for 21 languages and it powers wikdict.

answered Jul 29 '15 at 10:18

yota

2,020
22
37

WikDict also provide downloads of translation data which has been further processed to make it easier to use. See http://www.wikdict.com/page/about . – Karl Bartel Jan 22 '17 at 19:08

Jan Berkel · Answer 9 · 2015-06-17T00:57:20.597

1

There is also JWKTL which does a good job at parsing and extracting structured data from wiktionary. It is written in Java and has support for the English, German and Russian editions.

edited Jun 17 '15 at 00:57

answered Nov 28 '14 at 21:12

Jan Berkel

3,373
1
30
23

I think it doesn't support French, but German – Chin Jun 17 '15 at 00:23

score 0 · Answer 10 · answered Jun 17 '15 at 00:22

It depends on how thoroughly you need to parse it. If you just need to get all contents of a word in a language (definition, etymology, pronunciation, conjugation, etc.) then it's pretty easy. I had done this before, although in Java using jsoup

However, if you need to parse it down to different components of the content (e.g. just getting the definitions of a word), then it will be much more challenging. A Wiktionary entry for a word in a language has no pre-defined template, so a header can be anything from <h3> to <h6>, the order of the sections may be jumbled, they can be repetitive, etc.

score -1 · Answer 11 · answered May 19 '18 at 11:07

I wrote a primitive parser for the German Wiktionary dump in Java that only extracts nouns and their articles, plus their Arabic translation, without any dependencies. Execution takes a long time, so be warned. If there’s interest/need to parse more or other data, please tell me, I might look into it as time permits.

Has anyone parsed Wiktionary?

11 Answers11

Linked