8

There is a lack of online resources that demonstrate how I might parse a Wiktionary API response, that looks like this:

{
    "query": {
        "pages": {
            "40915": {
                "pageid": 40915,
                "ns": 0,
                "title": "reluctant",
                "revisions": [
                    {
                        "contentformat": "text/x-wiki",
                        "contentmodel": "wikitext",
                        "*": "==English==\n\n===Etymology===\nFrom {{etyl|la|en}} {{term|lang=la|reluctans}}, present participle of {{term|reluctare}}, {{term|reluctari||to struggle against, oppose, resist}}, from {{term|re-||back}} + {{term|luctari||to struggle}}.\n\n===Pronunciation===\n* {{IPA|/ɹɪˈlʌktənt/}}\n* {{audio|en-us-reluctant.ogg|Audio (US)}}\n\n===Adjective===\n{{en-adj}}\n\n# {{context|now|_|rare|lang=en}} [[opposing|Opposing]]; offering [[resistance]] (to).\n#* '''1819''', Lord Byron, ''Don Juan'', II.108:\n#*: There, breathless, with his digging nails he clung / Fast to the sand, lest the returning wave, / From whose '''reluctant''' roar his life he wrung, / Should suck him back to her insatiate grave [...].\n#* '''2008''', Kern Alexander et al., ''The World Trade Organization and Trade in Services'', p. 222:\n#*: They are '''reluctant''' to the inclusion of a necessity test, especially of a horizontal nature, and emphasize, instead, the importance of procedural disciplines [...].\n# Not [[wanting]] to take some [[action]]; [[unwilling]].\n#: ''She was '''reluctant''' to lend him the money''\n\n====Synonyms====\n* [[unwilling]], [[disinclined]]\n\n====Translations====\n{{trans-top|not wanting to take some action}}\n* Chinese: \n*: Mandarin: {{t|cmn|不情願|sc=Hani}}, {{t+|cmn|不情愿|tr=bùqíngyuàn|sc=Hani}}\n* Czech: {{t|cs|neochotný}}, {{t|cs|zdráhající}} se\n* Dutch: {{t+|nl|aarzelend}}\n* Finnish: {{t+|fi|haluton}}, {{t+|fi|vastahakoinen}}\n* French: {{t+|fr|réservé}},  {{t+|fr|réfractaire}},  {{t+|fr|rétif}}\n* German: {{t|de|zögernd}}\n* Hungarian: {{t|hu|kelletlen}}\n* Indonesian: {{t+|id|enggan}}\n* Interlingua: [[reluctante]]\n* Italian: {{t+|it|riluttante}}\n{{trans-mid}}\n* Latin: {{t|la|invītus}}\n* Manx: {{t|gv|neuarryltagh}}, {{t|gv|neuwooiagh}}\n* Maori: {{t|mi|whakawhēuaua}}, {{t|mi|manauhea}}\n* Polish: [[niechętny]]\n* Romanian: reticent, precaut, {{t|ro|prevăzător}}\n* Russian: {{t+|ru|неохотный|tr=neoxótnyj}}\n* Scots: {{t|sco|sweer}}, {{t|sco|sweirt}}, {{t|sco|laith}}\n* Scottish Gaelic: {{t|gd|aindeònach}}, {{t|gd|leisg}}\n* Spanish: {{t+|es|renuente}}, {{t|es|reacio}}\n* Swedish: {{t|sv|motvillig}}\n{{trans-bottom}}\n\n====Related terms====\n* [[reluctance]]\n* [[reluctantly]]\n\n===External links===\n* {{R:Webster 1913}}\n* {{R:Century 1911}}\n* {{R:OneLook}}\n\n[[ca:reluctant]]\n[[cy:reluctant]]\n[[et:reluctant]]\n[[el:reluctant]]\n[[es:reluctant]]\n[[fr:reluctant]]\n[[ko:reluctant]]\n[[io:reluctant]]\n[[kn:reluctant]]\n[[ku:reluctant]]\n[[hu:reluctant]]\n[[mg:reluctant]]\n[[ml:reluctant]]\n[[my:reluctant]]\n[[nl:reluctant]]\n[[pl:reluctant]]\n[[pt:reluctant]]\n[[simple:reluctant]]\n[[fi:reluctant]]\n[[sv:reluctant]]\n[[ta:reluctant]]\n[[te:reluctant]]\n[[th:reluctant]]\n[[vi:reluctant]]\n[[zh:reluctant]]"
                    }
                ]
            }
        }
    }
}

Basically all I want is the English definition, but the response format is so odd, that everything about the word is jumbled up into one large inseparable blob.

  1. Is there an API way to get the response in an actual JSON format, where the English definition would just be a JSON key?
  2. Would I have to resort to a regex pattern to do this, and how might that look?
  3. Lastly, why would the API designers return data like this? I want to judge and say they have no idea what they're doing, but surely there must be a reason.
Snowman
  • 31,411
  • 46
  • 180
  • 303
  • 3
    The obvious answer to why the API doesn't break down the page into definitions is that it's a generic mediawiki API, not a wiktionary API, and doesn't know anything about the structure of the page (which is just a set of conventions followed by wiktionary contributors, not a formally specified, machine-parseable standard). –  Dec 02 '13 at 20:46
  • As I'm not affiliated with Wiktionary (but having parsed their data in our project), I can only assume, that the reason for the structure is, that they use a normal MediaWiki as foundation which does not provide a "dictionary style" structure. In our project we parsed the database dump using a combination ``String#indexOf``, ``#substring``, etc. and a bunch of regular expressions. Terrible code and maintenance nightmare. – qqilihq Dec 02 '13 at 20:47
  • http://www.mediawiki.org/wiki/Alternative_parsers looks like a good place to start parsing the wikitext. The final step of deciding how the wiki syntax tree maps onto dictionary definitions will be up to you though. –  Dec 02 '13 at 20:54
  • [Wiktionary-l](https://lists.wikimedia.org/pipermail/wiktionary-l/) has the experience of many people on how they did this. – Nemo Jul 25 '15 at 17:33
  • 1
    Possible duplicate of [Has anyone parsed Wiktionary?](http://stackoverflow.com/questions/3364279/has-anyone-parsed-wiktionary) – Nemo Feb 13 '16 at 19:04

1 Answers1

7

use extracts property to get html version

https://en.wiktionary.org/w/api.php?titles=cloud&action=query&prop=extracts&format=json

Raj
  • 3,890
  • 7
  • 52
  • 80
neuronet
  • 1,139
  • 8
  • 19