3

I'm going to parse a Wiktionary file in many languages (English, Japanese, etc). From here (Parse Wiktionary XML data dump into MySQL database using PHP) I see the basic structure of it. But my question is that what these elements stand for?

For example, I think the title under page element is a word in the vocabulary. But where is its translation in other languages? Where are its synonyms?

Community
  • 1
  • 1

1 Answers1

3

"...translation in other languages? Where are its synonyms?"

There are three bad news for you.

  1. All this information (translations, synonyms) are a plain text of the Wiktionary article.

  2. Different Wiktionaries have different structure of the dictionary article. For example, compare the structure of the article in the English Wiktioinary and in the Russian Wiktionary.

  3. The structure of Wiktionary article is not presented in the XML-file, it is just a simple plain text, see item 1. Thus you need to parse this text in order to extract synonyms or translation.

You are welcome to read my paper about transforming (parsing) texts of Wiktionary articles to machine-readable database: http://arxiv.org/abs/1011.1368

  • Nice! Hope it helps! Now I just read each line into Python and extract information. But it seems that the exceptions are everywhere and it's hard to use a rule to extract them. Hmm... –  Sep 12 '15 at 17:42