I have problem with importing big xml file (1.3 gb) into mongodb in order to search for most frequent words in map & reduce manner.
http://dumps.wikimedia.org/plwiki/20141228/plwiki-20141228-pages-articles-multistream.xml.bz2
Here I enclose xml cut (first 10 000 lines) out from this big file:
http://www.filedropper.com/text2
I know that I can't import xml directly into mongodb. I used some tools do so. I used some python scripts and all has failed.
Which tool or script should I use? What should be a key & value? I think the best solution to find most frequent world would be this.
(_id : id, value: word )
then I would sum all the elements like in docs example:
http://docs.mongodb.org/manual/core/map-reduce/
Any clues would be greatly appreciated, but how to import this file into mongodb to have collections like that?
(_id : id, value: word )
If you have any idea please share.
Edited After research, I would use python or js to complete this task.
I would extract only words in <text></text>
section which is under /<page><revision>
, exlude <, > etc., and then separate words and upload them to mongodb with pymongo or js.
So there are several pages with revision and text.
Edited