I have the following script which I have written for calculating statistics about textual corpora I analyze under a linguistics angle. However, the text files I analyze are relatively big for such processes (~3Gb, ~500M words), which is probably what makes my script inefficient given my current hardware (i5, 16Gb RAM). The 'MemoryError' I get is when I launch the script through the Terminal, so I must admit that I am unsure whether this is a Python of Bash error message, although I reckon that the implications are the same, but correct me if I'm wrong.
I am not a computer scientist, and so it is very likely that the tools I use are not the most adapted/efficient for the task, so would anyone have any recommendation to improve the script and make it able to handle such volumes of data? Please keep in mind that my tech/programming knowledge is relatively limited, being a linguist before all, so if you could explain the technical stuff with that in mind that would be awesome.
Thanks a lot in advance!
EDIT: here is the error message I get, as required by some of you:
"Traceback (most recent call last): File "/path/to/my/myscript.py", line 43, in keywords, target_norm, reference_norm, smp_score = calculate_keywords('file1.txt', 'file2.txt') File "/path/to/my/myscript.py", line 9, in calculate_keywords target_text = f.read().lower().translate(str.maketrans('','','?!"():;.,“/[]')) MemoryError
#!/usr/bin/env python3
import collections
import math
import string
def calculate_keywords(target, reference):
with open(target, 'r') as f:
target_text = f.read().lower().translate(str.maketrans('','','?!"():;.,“/[]'))
target_words = target_text.split()
with open(reference, 'r') as f:
reference_text = f.read().lower().translate(str.maketrans('','','?!"():;.,“/[]'))
reference_words = reference_text.split()
target_freq = collections.Counter(target_words)
reference_freq = collections.Counter(reference_words)
target_total = sum(target_freq.values())
reference_total = sum(reference_freq.values())
target_norm = {}
reference_norm = {}
for word, freq in target_freq.items():
target_norm[word] = freq / target_total * 1000000
for word, freq in reference_freq.items():
reference_norm[word] = freq / reference_total * 1000000
smp_scores = {}
for word, freq in target_norm.items():
if word not in reference_norm:
reference_norm[word] = 0
s1 = freq + 100
s2 = reference_norm[word] + 100
smp_scores[word] = s1 / s2
keywords = sorted(smp_scores, key=smp_scores.get, reverse=True)[:50]
return keywords, target_norm, reference_norm, smp_scores
keywords, target_norm, reference_norm, smp_score = calculate_keywords('myfile1.txt', 'myfile2.txt')
for word in keywords:
print(f"{word} {target_norm[word]} {reference_norm[word]} {smp_score[word]}")