wiki dump encoding

Question

I'm using WikiPrep to process the latest wiki dump enwiki-20121101-pages-articles.xml.bz2. Instead of "use Parse::MediaWikiDump;" I replaced that by "use MediaWiki::DumpFile::Compat;" and did the proper changes in the code. Then, I ran

perl wikiprep.pl -f enwiki-20121101-pages-articles.xml.bz2

I got an error

enwiki-20121101-pages-articles.xml.bz2:1: parser error : Document is empty
BZh91AY&SY±H¦ÂOÿ~Ð`ÿÿÿ¿ÿÿÿ¿ÿÿÿÿÿÿÿÿÿÿ½ÿýþdß8õEnÞ¶zëJ¨Eà®mEÓP|f÷Ô
^

I guess there are some non-utf8 characters contained in the dump. So I ran

iconv -f utf8 -t utf8 enwiki-20121101-pages-articles.xml.bz2

And indeed, I got some errors

BZh91AY&SYiconv: illegal input sequence at position 10

So, my question is what's the encoding format of wiki dump and if I wish to convert it to utf-8, what shall I do? Or how should modify wikiprep.pl to avoid such problems.

Many thanks

-- [solved] I should first unzip the file first.

You are running iconv on the compressed (bz2) version of the file, rather than the XML file itself. — borrible, Nov 30 '12 at 10:00
hi, borrible 2. Thanks very much. Indeed, I should unzip the file first. I got it right now. — xuan, Dec 03 '12 at 09:59

score 1 · Answer 1 · answered Apr 06 '15 at 11:46

1

You are running iconv on the compressed (bz2) version of the file, rather than the XML file itself. Uncompress it first.

(Posting borrible's answer so that this resolved question is not listed as unanswered.)

answered Apr 06 '15 at 11:46

Nemo

2,441
2
29
63

wiki dump encoding

1 Answers1