0

I am looping over a very large document to try and lemmatise it. Unfortunately python does not seem to print to file for every line but run through the whole document before printing, which given the size of my file exceeds the memory... Before I chunk my document into more bite-sized chunks I wondered if there was a way to force python to print to file for every line.

So far my code reads:

import spacy
nlp = spacy.load('de_core_news_lg')
  
fin = "input.txt" 
fout = "output.txt"
    
    
#%%
    
with open(fin) as f:
   corpus = f.readlines()
    
corpus_lemma = []
    
for word in corpus:
   result = ' '.join([token.lemma_ for token in nlp(word)])
   corpus_lemma.append(result)
    
   with open(fout, 'w') as g:
      for item in corpus_lemma:
         g.write(f'{item}')

To give credits for the code, it was kindly suggested here: Ho to do lemmatization on German text?

Emiliano Viotti
  • 1,619
  • 2
  • 16
  • 30
la_lo_ca
  • 11
  • 2

1 Answers1

0

As described in: How to read a large file - line by line?

If you do your lemmatisation inside the with block, Python will handle reading line by line using buffered I/O.

In your case, it would look like:

import spacy
nlp = spacy.load('de_core_news_lg')

fin = "input.txt" 
fout = "output.txt"


#%%

corpus_lemma = []

with open(fin) as f:
    for line in f:
        result = " ".join(token.lemma_ for token in nlp(line))
        corpus_lemma.append(result)

with open(fout) as g:
    for item in corpus_lemma:
        g.write(f"{item}")

Victor Maricato
  • 672
  • 8
  • 25