Loading Wikipedia XML files into Gensim

Question

I'm a complete novice to NLP and would like to load a zipped XLM file of the Hungarian Wikipedia corpus (807 MB). I downloaded the dumpfile and started parsing it in Python with Gensim, but after 4 hours my laptop crashed, complaining that I had run out of RAM. I have a fairly old laptop (4GB RAM) and was wondering whether there is any way I could solve this problem by

(1) either tinkering with my code, e.g, by reducing the corpus by taking, say, a 1/10th random sample of it;
(2) or using some cloud platform to enhance my CPU power. I read in this SO post that AWS can be used for such puposes, but I am unsure which service I should select (Amazon EC2?). I also checked Google Colab, but got confused that it lists hardware acceleration options (GPU and CPU) in the context of Tensorflow, and I am not sure if that is suitable for NLP. I didn't find any posts about that.

Here's my Jupyter Notebook code that I've tried after downloading the wikipedia dumps from here:

! pip install gensim 
from nltk.stem import SnowballStemmer
from gensim.corpora import WikiCorpus
from gensim.models.word2vec import Word2Vec

hun_stem = SnowballStemmer(language='hungarian')

%%time
hun_wiki = WikiCorpus(r'huwiki-latest-pages-articles.xml.bz2')
hun_articles = list(hun_wiki.get_texts())
len(hun_articles)

Any guidance would be much appreciated.

Are you using a Jupyter Notebook? Have you tried running your program on a subset of the XML, to see if the issue is a matter of size or the design of the program? — AMC, Dec 15 '19 at 22:09
Can you please let me know how I can run the code on a subset of the XML? Yes, I am using Jupyter Notebook, I just added that piece of info to my post as well. — babesz, Dec 15 '19 at 22:35
That depends on the structure of the data, which I’m not familiar with. I just noticed you actually mentioned this possibility in your post, as solution (1). — AMC, Dec 15 '19 at 22:52

score 0 · Accepted Answer · answered Dec 16 '19 at 01:47

0

807MB compressed will likely expand to more than 4GB uncompressed, so you're not going to have luck loading the whole data into memory on your machine.

But, lots of NLP tasks don't require the full dataset in memory: they can just stream the data repeatedly from the disk as necessary.

For example, whatever your ultimate goal is, you will often be able to just iterate over the hun_wiki.get_texts() sequence, article by article. Don't try to load it into a single in-memory list with a list() operation.

(If you really wanted to just load a subset as a list, you could just take the first n from that iterator, or take a random subset via one of the ideas at an answer like this one.)

Or, you could rent a cloud machine with more memory. Almost anythin you choose with more memory will be suitable for running Python-based text-processing code, so just follow each service's respective tutorials to learn how to set up & log-into a new rented instance.

(4GB is quite small for modern serious work, but if you're just tinkering/learning, you can work with smaller datasets and be efficient about not loading everything into memory when not necessary.)

answered Dec 16 '19 at 01:47

gojomo

52,260
14
86
115

You were right, I can iterate with the `get.text()` sequence. However, I just realised that the `WikiCorpus()` function also seems to run for a very long time, and I don't quite understand which one of these I should boost if I want to maximise the number of articles in my corpus. Whatever time it takes, I guess my main question is whether small RAM can also be an issue if I rely on the solution suggested by you, or I don't have to worry about that anymore? – babesz Dec 16 '19 at 23:59
4GB will be limiting for this kind of work: you'll often have to limit your approaches, or datasets, or model sizes, to better fit within that addressable space. You'd be able to do more, faster, with fewer distractions, with more RAM. But it's still possible to do lots, especially as a beginner learning basic techniques, in 4GB. Whether streaming as I've suggested is enough for what you want to do depends on what you want to do next – which you haven't really detailed. – gojomo Dec 17 '19 at 03:29
For example, if just transforming all the texts by some rule – as with stemming – that's easy to do streaming in one file, streaming output to another file. On the other hand, if building a `Word2Vec` model, the model's size is largely determined by the number of words you want to retain. With less memory, you'll have to discard more of the lower-frequency words to stay in available RAM. But for lots uses of word-vectors, discarding lower-frequency words doesn't hurt too much. So there's no definitive answer about how limiting it is, until you get into specific goals/techniques. – gojomo Dec 17 '19 at 03:31
I have to understand more the arguments of these functions, but thanks for the feedback! – babesz Dec 17 '19 at 20:19

Loading Wikipedia XML files into Gensim

1 Answers1