6

Alright, I'm just trying to parse Wiktionary Data Dump provided by Wikimedia.

My intention is to parse that XML data dump into MySQL database. I didn't find proper documentation regarding the structure of this XML. Also, I'm not able to open the file because it's infact really huge (~1 GB).

I thought of parsing it using some PHP script but I don't have any idea about the XML structure to proceed. So If anyone had already parsed (or have idea about any tool to parse) into MySQL using PHP, Please share the details. If nothing in PHP, Other methods are also fine.

I just followed this post (http://www.igrec.ca/lexicography/installing-a-local-copy-of-wiktionary-mysql/) but it didn't work out..:( If anybody have succeed in this process, please help. Thanks in Advance.

Nemo
  • 2,441
  • 2
  • 29
  • 63
Jenson M John
  • 5,499
  • 5
  • 30
  • 46
  • Possible duplicate of [Has anyone parsed Wiktionary?](http://stackoverflow.com/questions/3364279/has-anyone-parsed-wiktionary) – Nemo Feb 13 '16 at 19:05
  • [kaikki](https://kaikki.org/) now provides a (imperfect) parse of a few Wiktionary languages – alexchandel Aug 08 '22 at 05:36

1 Answers1

4

Those files can be parsed in PHP with XMLReader operating on a compress.bzip2:// stream. The structure of the file you have is exemplary (peeking into ca. the first 3000 elements):

\-mediawiki (1)
  |-siteinfo (1)
  | |-sitename (1)
  | |-base (1)
  | |-generator (1)
  | |-case (1)
  | \-namespaces (1)
  |   \-namespace (40)
  \-page (196)
    |-title (196)
    |-ns (196)
    |-id (196)
    |-restrictions (2)
    |-revision (196)
    | |-id (196)
    | |-parentid (194)
    | |-timestamp (196)
    | |-contributor (196)
    | | |-username (182)
    | | |-id (182)
    | | \-ip (14)
    | |-comment (183)
    | |-text (195)
    | |-sha1 (195)
    | |-model (195)
    | |-format (195)
    | \-minor (99)
    \-redirect (5)

The file itself is a little larger, so it takes quite some time to process. Alternatively do not operate on the XML dumps, but just import the SQL dumps via the mysql commandline tool. SQL dumps are available on the site as well, see all dump formats for the English Wiktionary:


The overall file was a litte larger with more than 66 849 000 elements:

\-mediawiki (1)
  |-siteinfo (1)
  | |-sitename (1)
  | |-base (1)
  | |-generator (1)
  | |-case (1)
  | \-namespaces (1)
  |   \-namespace (40)
  \-page (3993913)
    |-title (3993913)
    |-ns (3993913)
    |-id (3993913)
    |-restrictions (552)
    |-revision (3993913)
    | |-id (3993913)
    | |-parentid (3572237)
    | |-timestamp (3993913)
    | |-contributor (3993913)
    | | |-username (3982087)
    | | |-id (3982087)
    | | \-ip (11824)
    | |-comment (3917241)
    | |-text (3993913)
    | |-sha1 (3993913)
    | |-model (3993913)
    | |-format (3993913)
    | \-minor (3384811)
    |-redirect (27340)
    \-DiscussionThreading (4698)
      |-ThreadSubject (4698)
      |-ThreadPage (4698)
      |-ThreadID (4698)
      |-ThreadAuthor (4698)
      |-ThreadEditStatus (4698)
      |-ThreadType (4698)
      |-ThreadSignature (4698)
      |-ThreadParent (3605)
      |-ThreadAncestor (3605)
      \-ThreadSummaryPage (11)
hakre
  • 193,403
  • 52
  • 435
  • 836
  • Thanks @hakre Cheers..:) But I wonder which sql file contains Words with meaning because some are links, site_stats etc. Anyway I just took enwiktionary-latest-page.sql.gz – Jenson M John Aug 11 '14 at 06:27
  • 1
    its most likely encoded into wikitext and not available as dedicated columns alredy. – hakre Aug 11 '14 at 06:58