So, in Python, I'm using makovify to build Markov models of large corpora of text to generate random sentences with it. I'm also using nltk to make the Markov models obey sentence structure. Since it takes quite a while to generate a Markov model from a large corpus, specially with nltk's part-of-speech tagger, generating the same model each time is quite wasteful therefore I decided to save the Markov models as JSON files to reuse them later. However, when I'm trying to read these multiple large JSON files in Python, I'm having some issues. The following is the code:
import nltk
import markovify
import os
import json
pathfiles = 'C:/Users/MF/Documents/NetBeansProjects/Newyckham/data/'
filenames = []
ebook = []
def build_it (path):
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith(".json"):
filenames.append(os.path.join(root, file))
for file in filenames:
print(str(file))
with open(file) as myjson:
ebook.append(markovify.Text.from_json(json.load(myjson)))
return ebook
text_model = markovify.combine(build_it(pathfiles))
for i in range(5):
print(text_model.make_sentence())
print('\r\n')
print(text_model.make_short_sentence(140))
print('\r\n')
But I get the following error:
Traceback (most recent call last):
File "C:\Users\MF\Desktop\eclipse\markovify-master\terceiro.py", line 24, in <
module>
text_model = markovify.combine(build_it(pathfiles))
File "C:\Users\MF\Desktop\eclipse\markovify-master\terceiro.py", line 21, in b
uild_it
ebook.append(markovify.Text.from_json(json.load(myjson)))
File "C:\Python27\lib\json\__init__.py", line 290, in load
**kw)
File "C:\Python27\lib\json\__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "C:\Python27\lib\json\decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Python27\lib\json\decoder.py", line 382, in raw_decode
obj, end = self.scan_once(s, idx)
MemoryError
I have read some similar question on this website about how to deal with this issue and most of them point towards using ijson and skipping the undesired parts of the JSON file, however, there's nothing really within those JSON that I can skip, so any ideas on this?