I'm a complete novice to NLP and would like to load a zipped XLM file of the Hungarian Wikipedia corpus (807 MB). I downloaded the dumpfile and started parsing it in Python with Gensim, but after 4 hours my laptop crashed, complaining that I had run out of RAM. I have a fairly old laptop (4GB RAM) and was wondering whether there is any way I could solve this problem by
- (1) either tinkering with my code, e.g, by reducing the corpus by taking, say, a 1/10th random sample of it;
- (2) or using some cloud platform to enhance my CPU power. I read in this SO post that AWS can be used for such puposes, but I am unsure which service I should select (Amazon EC2?). I also checked Google Colab, but got confused that it lists hardware acceleration options (GPU and CPU) in the context of Tensorflow, and I am not sure if that is suitable for NLP. I didn't find any posts about that.
Here's my Jupyter Notebook code that I've tried after downloading the wikipedia dumps from here:
! pip install gensim
from nltk.stem import SnowballStemmer
from gensim.corpora import WikiCorpus
from gensim.models.word2vec import Word2Vec
hun_stem = SnowballStemmer(language='hungarian')
%%time
hun_wiki = WikiCorpus(r'huwiki-latest-pages-articles.xml.bz2')
hun_articles = list(hun_wiki.get_texts())
len(hun_articles)
Any guidance would be much appreciated.