How can I retrieve Wiktionary word content?

Question

How may Wiktionary's API be used to determine whether or not a word exists?

Anyone who has read the documentation will see that the API contains nowhere near enough functionality to "retrieve Wiktionary word content". I'd estimate it gets you roughly 1% of the way. You can retrieve raw wiki syntax or parsed HTML and from there you have to do everything yourself. Having said that there might be a very new experimental API that works only on the English Wiktionary. — hippietrail, Aug 28 '16 at 06:28
Get all Wiktionary articles in individual JSON files here: https://github.com/dan1wang/jsonbook-builder — daniel, Apr 13 '19 at 09:05
An even better parsed JSON version is here: https://kaikki.org/ — Pux, May 31 '22 at 09:41

score 86 · Accepted Answer · edited Sep 29 '14 at 06:07

86

The Wiktionary API can be used to query whether or not a word exists.

Examples for existing and non-existing pages:

http://en.wiktionary.org/w/api.php?action=query&titles=test http://en.wiktionary.org/w/api.php?action=query&titles=testx

The first link provides examples on other types of formats that might be easier to parse.

To retrieve the word's data in a small XHTML format (should more than existence be required), request the printable version of the page:

http://en.wiktionary.org/w/index.php?title=test&printable=yes http://en.wiktionary.org/w/index.php?title=testx&printable=yes

These can then be parsed with any standard XML parser.

edited Sep 29 '14 at 06:07

Dave Jarvis

30,436
41
178
315

answered May 05 '10 at 04:08

Michael Mrozek

169,610
28
168
175

4

Thanks; the API itself is not what I was hoping for but the link you provided is what I was looking for. – Armentage May 14 '10 at 02:19
2

Now it accepts additional format parameter for other than xml output like so : https://en.wiktionary.org/w/api.php?action=query&titles=test&format=json – eenagy Jun 28 '15 at 18:07
4

Might not work as you expect though https://en.wiktionary.org/wiki/Category:English_misspellings https://en.wiktionary.org/wiki/amatuer – endolith Apr 30 '16 at 04:15
Use: `https://en.wiktionary.org/w/?curid=[page_id]&printable=yes`, to redirect to the XHTML page using `pageid`. – mie.ppa May 18 '19 at 08:53
1

If you need to fetch the data using the browser, you can use ```https://en.wiktionary.org/w/api.php?format=json&action=query&origin=*&export&exportnowrap&titles=test``` to avoid CORS-related problems – David Airapetyan May 21 '19 at 00:49
5

How to filter in this API for only English words? – Nathan B Oct 27 '19 at 15:24
1

Use HTTPS with those example. The current http version isn't giving results – adjwilli Aug 13 '20 at 14:03
1

Sadly the printable XHTML seems poorly supported. There's a *no longer supported* warning shown. Also, I found that it gives me invalid XHTML, specifically an unclosed tag. Here's the URL I used: https://en.wiktionary.org/w/?curid=103410&printable=yes , alternatively: https://en.wiktionary.org/w/index.php?title=test&printable=yes – Max Barraclough Jan 08 '21 at 16:55
1

I've been playing with this myself. I think, if you want to check whether a word is valid in English, you want to use `https://en.wiktionary.org/w/api.php?action=query&format=xml&prop=categories&titles=`WORDS`%7C`TO`%7C`CHECK`&clcategories=Category%3AEnglish%20lemmas%7CCategory%3AEnglish%20non-lemma%20forms%7CCategory%3AEnglish%20eye%20dialect`. Then, "valid in English" means a result that has the category "English lemmas" or "English non-lemma forms" but doesn't have the category "English eye dialect". However the set of words meeting these criteria may still be overly broad for many uses. – j__m Apr 04 '21 at 22:14

hippietrail · Answer 2 · 2016-08-28T06:25:05.823

There are a few caveats in just checking that Wiktionary has a page with the name you are looking for:

Caveat #1: All Wiktionaries including the English Wiktionary actually have the goal of including every word in every language, so if you simply use above API call you will know that the word you are asking about is a word in at least one language, but not necessarily English: http://en.wiktionary.org/w/api.php?action=query&titles=dicare

Caveat #2: Perhaps a redirect exists from one word to another word. It might be from an alternative spelling, but it might be from an error of some kind. The API call above will not differentiate between a redirect and an article: http://en.wiktionary.org/w/api.php?action=query&titles=profilemetry

Caveat #3: Some Wiktionaries including the English Wiktionary include "common misspellings": http://en.wiktionary.org/w/api.php?action=query&titles=fourty

Caveat #4: Some Wiktionaries allow stub entries which have little or no information about the term. This used to be common on several Wiktionaries but not the English Wiktionary. But it seems to have now spread also to the English Wiktionary: https://en.wiktionary.org/wiki/%E6%99%B6%E7%90%83 (permalink for when the stub is filled so you can still see what a stub looks like: https://en.wiktionary.org/w/index.php?title=%E6%99%B6%E7%90%83&oldid=39757161)

If these are not included in what you want, you will have to load and parse the wikitext itself, which is not a trivial task.

What I really wanted to do was take a full dump of the data on one of the non-English Wikitionary sites, and then turn the contents into something I could use locally. It seems silly now, but I was hoping that I could request the list of all words, and then pull down their defitions/translations one at a time as needed. — Armentage, Dec 05 '10 at 17:51
The fix to Caveat #2 is simple: add `&prop=info` to the query and check the response for `redirect` attribute. — svick, Apr 30 '12 at 11:17
@svick: Yes it's true #2 is easier to circumvent when using the API but these basic caveats also cover trying to parse the [Wiktionary data dump files](http://dumps.wikimedia.org/enwiktionary/), even though this question doesn't ask about that approach. — hippietrail, Apr 30 '12 at 11:26

score 24 · Answer 3 · edited Sep 12 '21 at 08:55

24

You can download a dump of Wiktionary data. There's more information in the FAQ. For your purposes, the definitions dump is probably a better choice than the XML dump.

edited Sep 12 '21 at 08:55

Peter Mortensen

30,738
21
105
131

answered Aug 18 '11 at 08:15

kybernetikos

8,281
1
46
54

3

Those dump files are massive, and it's unclear which ones to download (all of them?). Probably not what most people are looking for it they just want to programmatically lookup a handful of words. – Cerin Jun 14 '12 at 18:25
1

I explain which file to download - i.e. the definitions dump (the directory from my link is just different versions of the same file), and yes, if you programmatically want to look up words this is ideal. If you can guarantee the program will be executed only online, there are other options, but nevertheless I'm answering this part of the original question: "Alternatively, is there any way I can pull down the dictionary data that backs a Wiktionary?" – kybernetikos Jun 19 '12 at 20:18
22

Definitions dump link is no longer available. – live-love Aug 11 '15 at 16:51

score 13 · Answer 4 · edited Sep 12 '21 at 08:55

13

To keep it really simple, extract the words from the dump like this:

bzcat pages-articles.xml.bz2 | grep '<title>[^[:space:][:punct:]]*</title>' | sed 's:.*<title>\(.*\)</title>.*:\1:' > words

edited Sep 12 '21 at 08:55

Peter Mortensen

30,738
21
105
131

answered Mar 24 '12 at 23:14

benroth

2,468
3
24
25

how do I get a copy of pages-articles.xml.bz2? – Armentage Apr 10 '12 at 13:27
It's just a generic name I used to describe the dumps of the form `LANGwiktionary-DATE-pages-articles.xml.bz2` . Go to [link](http://dumps.wikimedia.org/backup-index.html), then click `LANGwiktionary` (LANG e.g. 'en', 'de'...). – benroth Apr 11 '12 at 07:52
That's great, thanks! If you want to get the words with a dash or space in it, you should use: ``bzcat pages-articles.xml.bz2 | grep '\(.*\)' | sed 's:.*\(.*\).*:\1:' > words`` – nico_lrx Feb 22 '22 at 18:02

score 10 · Answer 5 · answered Mar 20 '18 at 19:43

If you are using Python, you can use WiktionaryParser by Suyash Behera.

You can install it by

sudo pip install wiktionaryparser

Example usage:

>>> from wiktionaryparser import WiktionaryParser
>>> parser = WiktionaryParser()
>>> word = parser.fetch('test')
>>> another_word = parser.fetch('test', 'french')
>>> parser.set_default_language('french')

score 4 · Answer 6 · edited Sep 12 '21 at 09:08

4

You could use the revisions API:

https://en.wiktionary.org/w/api.php?action=query&prop=revisions&titles=test&rvslots=*&rvprop=content&formatversion=2

Or the parse API:

https://en.wiktionary.org/w/api.php?action=parse&page=test&prop=wikitext&formatversion=2

More examples are provided in the documentation.

edited Sep 12 '21 at 09:08

Peter Mortensen

30,738
21
105
131

answered Aug 14 '20 at 04:11

builder-7000

7,131
3
19
43

You can also add `&format=json` to the urls to have a formatted response. – remborg Jul 25 '22 at 11:35

score 2 · Answer 7 · edited Sep 12 '21 at 08:44

2

You might want to try JWKTL out. I just found out about it ;)

edited Sep 12 '21 at 08:44

Peter Mortensen

30,738
21
105
131

answered Jan 24 '11 at 02:39

arek

29
1

1

The citation that you refer to is broken. Here is a link to the JWKTL page http://www.ukp.tu-darmstadt.de/software/jwktl/. It's not really what I believe the OP is looking for though. – djskinner Jan 14 '13 at 14:41
The second link is (effectively) broken. It redirects to a genetic page, *[Welcome to the Ubiquitous Knowledge Processing (UKP) Lab!](https://www.informatik.tu-darmstadt.de/ukp/ukp_home/index.en.jsp)*. – Peter Mortensen Sep 12 '21 at 08:45
The Wikipedia reference leads to *[Extracting lexical semantic knowledge from Wikipedia and Wiktionary](http://www.lrec-conf.org/proceedings/lrec2008/pdf/420_paper.pdf)* and *"...JWKTL (Java-based WiKTionary Library)..."*. – Peter Mortensen Sep 12 '21 at 08:53

score 2 · Answer 8 · edited Sep 12 '21 at 09:06

As mentioned earlier, the problem with this approach is that Wiktionary provides the information about all the words of all the languages. So the approach to check if a page exists using Wikipedia API won't work because there're a lot of pages for non-English words. To overcome this, you need to parse each page to figure out if there's a section describing the English word. Parsing wikitext isn't a trivial task, though in your case it's not that bad. To cover almost all the cases you need to just check if the wikitext contains the English heading. Depending on the programming language you use, you can find some tools to build an AST from wikitext. This will cover most of the cases, but not all of them because Wiktionary includes some common misspellings.

As an alternative, you could try using Lingua Robot or something similar. Lingua Robot parses the Wiktionary content and provides it as a REST API. A non-empty response means that the word exists. Please note that, as opposed to Wiktionary, the API itself doesn't include any misspellings (at least at the moment of writing this answer). Please also note that the Wiktionary contains not only the words, but multi-word expressions.

score 1 · Answer 9 · edited Sep 12 '21 at 09:10

Here's a start to parsing etymology and pronunciation data:

function parsePronunciationLine(line) {
  let val
  let type
  line.replace(/\{\{\s*a\s*\|UK\s*\}\}\s*\{\{IPA\|\/?([^\/\|]+)\/?\|lang=en\}\}/, (_, $1) => {
    val = $1
    type = 'uk'
  })
  line.replace(/\{\{\s*a\s*\|US\s*\}\}\s*\{\{IPA\|\/?([^\/\|]+)\/?\|lang=en\}\}/, (_, $1) => {
    val = $1
    type = 'us'
  })
  line.replace(/\{\{enPR|[^\}]+\}\},?\s*\{\{IPA\|\/?([^\/\|]+)\/?\|lang=en}}/, (_, $1) => {
    val = $1
    type = 'us'
  })
  line.replace(/\{\{a|GA\}\},?\s*\{\{IPA\|\/?([^\/\|]+)\/?\|lang=en}}/, (_, $1) => {
    val = $1
    type = 'ga'
  })
  line.replace(/\{\{a|GA\}\},?.+\{\{IPA\|\/?([^\/\|]+)\/?\|lang=en}}/, (_, $1) => {
    val = $1
    type = 'ga'
  })
  // {{a|GA}} {{IPA|/ˈhæpi/|lang=en}}
  // * {{a|RP}} {{IPA|/pliːz/|lang=en}}
  // * {{a|GA}} {{enPR|plēz}}, {{IPA|/pliz/|[pʰliz]|lang=en}}

  if (!val)
    return

  return { val, type }
}

function parseEtymologyPiece(piece) {
  let parts = piece.split('|')
  parts.shift() // The first one is ignored.
  let ls = []
  if (langs[parts[0]]) {
    ls.push(parts.shift())
  }
  if (langs[parts[0]]) {
    ls.push(parts.shift())
  }
  let l = ls.pop()
  let t = parts.shift()
  return [ l, t ]
  // {{inh|en|enm|poisoun}}
  // {{m|enm|poyson}}
  // {{der|en|la|pōtio|pōtio, pōtiōnis|t=drink, a draught, a poisonous draught, a potion}}
  // {{m|la|pōtō|t=I drink}}
  // {{der|en|enm|happy||fortunate, happy}}
  // {{cog|is|heppinn||lucky}}
}

Here is a gist with it more fleshed out.

thanks, tried to run it inside browser devtools console. what is `langs`? — knb, Jun 10 '19 at 17:29
updated with a gist, `langs` is a few thousand lines, too big for SO. — Lance, Jun 11 '19 at 02:16

How can I retrieve Wiktionary word content?

9 Answers9

Linked

Related