7

I want to count entities/categories in wiki dump of a particular language, say English. The official documentation is very tough to find/follow for a beginner. What I have understood till now is that I can download an XML dump (What do I download out of all the available different files), and parse it (?) to count entities (The article topics) and categories.

This information, if available, is very difficult to find. Please help with some instructions as to how do I work with it or resources where I can learn about it.

Thanks!

  • Did you try to search for the dumps, download a recent one and open it with the 'less' command from a bash terminal? – Debasis Jul 23 '20 at 08:07
  • See: https://stackoverflow.com/q/30387731/6276743. It helped a lot. What exactly are you trying to do, in terms of counting categories and stuff? –  Jul 26 '20 at 21:58
  • Somewhat related: https://stackoverflow.com/questions/63934708/how-do-i-prepare-to-use-entire-wikipedia-for-natural-language-processing. Download the .zim file then scrape the pages like regular web scraping. (or rely on dbpedia) – amirouche Sep 20 '20 at 08:49
  • There is also HDT dumps that might be easier to use https://www.rdfhdt.org/what-is-hdt/ – amirouche Oct 29 '20 at 15:14

2 Answers2

6

The exact instructions what do to differ a lot based on your usecase. You can either download the dumps from https://dumps.wikimedia.org/enwiki/ and parse them locally, or you can also contact the API.

If you want to parse the dumps, https://jamesthorne.com/blog/processing-wikipedia-in-a-couple-of-hours/ is a good article that shows how one could do that.

However, parsing the dumps isn't always the best solution. If you want to know the three largest pages, for instance, you could use https://en.wikipedia.org/wiki/Special:LongPages.

In addition to all of this, you can also use https://quarry.wmcloud.org to query the live replica of Wikipedia's database. An example can be found at https://quarry.wmcloud.org/query/38441.

Rick
  • 138
  • 1
  • 10
Martin Urbanec
  • 426
  • 4
  • 11
4

The dumps are rather unwieldy: Even the small "truthy" dump is 25G. And because RDF is rather verbose, that expands to >100G. So my generic advice is to avoid the dumps.

If you can't help it, https://wdumps.toolforge.org/dumps allows you to create customised subsets of dumps with just the languages/properties/entities you want.

Then, just read it line-by-line and ... do something with each line

Matthias Winkelmann
  • 15,870
  • 7
  • 64
  • 76