2

I'm trying to parse the latest wikisource dump. More specifically, I would like to get all the pages under the Category:Ballads page. For this purpose I downloaded the https://dumps.wikimedia.org/enwikisource/latest/enwikisource-latest-pages-articles.xml.bz2 dump. In this dump the relevant page contains everything except the actual links:

<page>
    <title>Category:Ballads</title>
    <ns>14</ns>
    <id>115796</id>
    <revision>
      <id>4753508</id>
      <parentid>4003780</parentid>
      <timestamp>2014-01-25T16:21:08Z</timestamp>
      <contributor>
        <username>EmausBot</username>
        <id>983607</id>
      </contributor>
      <minor />
      <comment>Bot: Migrating 2 interwiki links, now provided by [[Wikipedia:Wikidata|Wikidata]] on [[d:Q8286819]]</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="51" xml:space="preserve">[[Category:Song lyrics]]
[[Category:Poems by form]]</text>
      <sha1>43eusqpjj6kaqcp6nl1tcmo4ass36ia</sha1>
    </revision>
  </page>
  <page>

My question is, how do I get the actual page content and all the links in this page?

Thank you!

Gilad
  • 538
  • 5
  • 16

1 Answers1

4

You downloaded the wrong version of a dump. If you're interested in categorylinks, you need to download https://dumps.wikimedia.org/enwikisource/latest/enwikisource-latest-categorylinks.sql.gz, for instance.

If you want XML format, you would need to parse this information yourself, from raw wikitext. For that, you can use https://dumps.wikimedia.org/enwikisource/latest/enwikisource-latest-pages-meta-current.xml.bz2.

EDIT per comments:

enwikisource-latest-pages-meta-current.xml doesn't contain machine-readable information about categories, it only contains information about the current page content. You would need to look for the text XML element, which contains the raw wikitext stored in the page. Usually, at the end of the content, it has something like this:

[[Category:American Civil War]]
[[category:American speeches]]

This indicates the page is in category "American Civil War" and "American speeches".

If you want a parsed info, you would need to deal with the .sql file AFAIK.

Martin Urbanec
  • 426
  • 4
  • 11
  • Thank you for the answer. I tried with the meta xml as well and got the same result. I also tried to import the sql dump into sqlite database and got syntax errors. I prefer not to run MySQL for this task. – Gilad Sep 30 '20 at 21:25
  • 1
    `enwikisource-latest-pages-meta-current.xml` doesn't have information about categories themselves, but it does contain article content; you would need to parse the article content to get the data about categories. I'll update the answer to explain that. – Martin Urbanec Oct 01 '20 at 11:44
  • Ad importing into sqlite, that's because mysql/mariadb SQL flavor is different from the flavor used by SQLite. I edited the answer to explain what you can do with the meta xml file. – Martin Urbanec Oct 01 '20 at 11:48