22

I downloaded the german wikipedia dump dewiki-20151102-pages-articles-multistream.xml. My short question is: What does the 'multistream' mean in this case?

hippietrail
  • 15,848
  • 18
  • 99
  • 158
m4ri0
  • 597
  • 1
  • 6
  • 10

2 Answers2

26

The dumps are compressed using bz2, bz2 support a parallel version allowing it to compress/decompress files faster . Compressed data using the parallel version is tagged as multistream.

Knowing this information makes a difference when you are processing the dump from a programming language, since you have to pass a flag to tell the library how to uncompress it (parallel or non parallel).

David Przybilla
  • 830
  • 6
  • 16
  • Could you please answer this question: https://stackoverflow.com/questions/48386791/extract-related-articles-in-different-languages-using-wikidata-toolkit?noredirect=1#comment84061677_48386791 – SahelSoft Feb 04 '18 at 15:12
5

multistream allows the use of an index to decompress sections as needed without having to decompress the entire thing.

This allows a reader to pull articles out of a compressed dump.

RobC
  • 502
  • 4
  • 17