Questions tagged [wikimedia-dumps]

48 questions
22
votes
2 answers

Multistream Wikipedia dump

I downloaded the german wikipedia dump dewiki-20151102-pages-articles-multistream.xml. My short question is: What does the 'multistream' mean in this case?
m4ri0
  • 597
  • 1
  • 6
  • 10
21
votes
2 answers

Empty list returned from ElementTree findall

I'm new to xml parsing and Python so bear with me. I'm using lxml to parse a wiki dump, but I just want for each page, its title and text. For now I've got this: from xml.etree import ElementTree as etree def parser(file_name): document =…
liloka
  • 1,016
  • 4
  • 14
  • 29
19
votes
9 answers

Parsing a Wikipedia dump

For example using this Wikipedia dump: http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=lebron%20james&rvprop=content&redirects=true&format=xmlfm Is there an existing library for Python that I can use to create an array with the…
tomwu
  • 397
  • 1
  • 3
  • 11
6
votes
1 answer

Is there any way to get wikipedia pageview statistics per page at the *country* grain (instead of simply language)?

I see dumps.wikimedia.org/other/pagecounts-raw/, for example, but no country-specific data there...
6
votes
0 answers

Getting Wikidata incremental triples

I would like to know if it is possible to get the latest incremental n-triple dumps of Wikidata. I'm using Wikidata Toolkit to download the latest version of the dumps and convert them automatically in n-triple files (using…
Ortzi
  • 363
  • 1
  • 6
  • 17
3
votes
0 answers

How to get the cutoff timestamp or lastrevid for a given Wikidata JSON dump?

I am using Wikidata enriched with other data sources and I must ingest the entire Wikidata JSON dump in a dev graph database of mine. That's easy and once that's done, I want to keep my copy updated by querying the RecentChanges and LogEvents API…
Lazhar
  • 1,401
  • 16
  • 37
3
votes
2 answers

Wiktionary in Structured Format

How do I aqcuire a Wiktionary, for say English, in structured format, typically RDF? The recommended website http://downloads.dbpedia.org/wiktionary/ is dead. And I don't understand if there are some existing frameworks that extract an…
Nordlöw
  • 11,838
  • 10
  • 52
  • 99
3
votes
2 answers

Extracting Wikimedia pageview statistics

Wikipedia provides all their page views in a hourly text file. (See for instance http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-01/) For a project is need to extract keywords and their associated page views for the year 2014. But seeing…
3
votes
2 answers

R XML: How to retrieve a node with a given value

Here's a snippet of XML file I am using: AccessibleComputing 0 10 381202555 381200179
2
votes
1 answer

Understanding wikimedia dumps

I'm trying to parse the latest wikisource dump. More specifically, I would like to get all the pages under the Category:Ballads page. For this purpose I downloaded the…
Gilad
  • 538
  • 5
  • 16
2
votes
1 answer

Select rows based on information stored in separate table

First of all I'm sorry for the overly vague title, however I'm unfamiliar with the proper terminology for a problem like this. I'm attempting to retrieve a list of page titles from Wiktionary (Wikimedia wiki-based dictionary) where the page must be…
Prime
  • 2,410
  • 1
  • 20
  • 35
2
votes
1 answer

Use a wikimedia image on my website

So I have a wikimedia commons URL(which is really just a wrapper for the actual image), like this: https://commons.wikimedia.org/wiki/File:Nine_inch_nails_-_Staples_Center_-11-8-13(10755555065_16053de956_o).jpg If I go to that page, I can see that…
dessalines
  • 6,352
  • 5
  • 42
  • 59
2
votes
3 answers

How to find old wikipedia dumps

I need to access to very old wikipedia dumps (backups of Wikipedia) in french. I succeed in finding a 2010 backup from archive.org, and now i'm searching for 2006 or even before. I know that in the latest dumps there is all the data from previous…
Léo Joubert
  • 522
  • 4
  • 17
2
votes
1 answer

Parse XML dump of a MediaWiki wiki

I am trying to parse an XML dump of the Wiktionary but probably I am missing something since I don't get anything as output. This is a similar but much shorter xml file:
CptNemo
  • 6,455
  • 16
  • 58
  • 107
2
votes
1 answer

wiki dump encoding

I'm using WikiPrep to process the latest wiki dump enwiki-20121101-pages-articles.xml.bz2. Instead of "use Parse::MediaWikiDump;" I replaced that by "use MediaWiki::DumpFile::Compat;" and did the proper changes in the code. Then, I ran perl…
xuan
  • 270
  • 1
  • 2
  • 15
1
2 3 4