5

I am trying to extract the interlanguage links from Wikipedia dumps. It seems that these links were moved to WikiData Project, and the access is provided only via API.

This branch explains how to deal with the issue and suggests to switch to the API: Retrieving the Interlanguage links from an exported Wikipedia article?

However, the scope of my research seems to be too large to use the web API (millions of queries). Does anyone know whether it is possible to extract these links from anywhere except API? Parsing the dump of any size is more preferrable than querying the API.

Wikipedia dumps I used: http://dumps.wikimedia.org/backup-index.html

WikiData dump I used: http://dumps.wikimedia.org/wikidatawiki/latest/

Community
  • 1
  • 1
Evgeny M
  • 77
  • 1
  • 8
  • All those that info is in the Wikidata dump. Why don't you just use that? Did you have any issues with that dump? (You say that you used it, but now how or how did it fail.) – svick Jul 14 '14 at 10:20
  • So, what's the issue with the wikidata dump? Also, did you look at `enwiki-20140614-langlinks.sql.gz`? – svick Jul 14 '14 at 12:25
  • Hi, svick, thank you for the answer. Unfortunately, I did not find enough interlanguage links (ILLs) in whe dumps from here: http://dumps.wikimedia.org/enwiki/20140614/ I used this one: Recombine articles, templates, media/file descriptions, and primary meta-pages. enwiki-20140614-pages-articles.xml.bz2 10.2 GB This dump still consists some ILLs, but the majority of them were moved to WikiData. – Evgeny M Jul 14 '14 at 12:31
  • From the WikiData dumps(http://dumps.wikimedia.org/wikidatawiki/latest/) the following were used: wikidatawiki-latest-pages-articles.xml.bz2 wikidatawiki-latest-pages-meta-current.xml.bz2 However, these pages contain information about the editors, but not the ILLs. But I should have missed something. Don't you propaply know the correct dump? – Evgeny M Jul 14 '14 at 12:32
  • Not yet. But enwiki-20140614-langlinks.sql.gz - is it an SQL extraction script? I will have a look at it right now, thank you! – Evgeny M Jul 14 '14 at 12:35
  • possible duplicate of [Easy way to export Wikipedia's translated titles](http://stackoverflow.com/questions/21000834/easy-way-to-export-wikipedias-translated-titles) – svick Jul 14 '14 at 13:34
  • Thank you, this is almost exactly that I need! – Evgeny M Jul 15 '14 at 00:00
  • There are a lot of links on Wikimedia dump downloading page: dumps.wikimedia.org/enwiki/latest . I need to process millions english articles (content) and their related articles on spanish and germany language. what files should I downloads? – SahelSoft Jan 21 '18 at 08:28

1 Answers1

2

A really excellent library for easily dealing with Wikidata dumps is Wikidata Toolkit, which abstract away a lot details for you. In the latest release 0.3 there is a growing collection of example scripts that help with basic tasks like yours. In the examples readme we find SitelinksExample.java:

This program shows how to get information about the site links that are used in Wikidata dumps. The links to Wikimedia projects use keys like "enwiki" for English Wikipedia or "hewikivoyage" for Hebrew WikiVoyage. To find out the meaning of these codes, and to create URLs for the articles on these projects, Wikidata Toolkit includes some simple functions that download and process the site links information for a given project. This example shows how to use this functionality.

notconfusing
  • 2,556
  • 3
  • 22
  • 26