1

I have the following script which I have written for calculating statistics about textual corpora I analyze under a linguistics angle. However, the text files I analyze are relatively big for such processes (~3Gb, ~500M words), which is probably what makes my script inefficient given my current hardware (i5, 16Gb RAM). The 'MemoryError' I get is when I launch the script through the Terminal, so I must admit that I am unsure whether this is a Python of Bash error message, although I reckon that the implications are the same, but correct me if I'm wrong.

I am not a computer scientist, and so it is very likely that the tools I use are not the most adapted/efficient for the task, so would anyone have any recommendation to improve the script and make it able to handle such volumes of data? Please keep in mind that my tech/programming knowledge is relatively limited, being a linguist before all, so if you could explain the technical stuff with that in mind that would be awesome.

Thanks a lot in advance!

EDIT: here is the error message I get, as required by some of you:

"Traceback (most recent call last): File "/path/to/my/myscript.py", line 43, in keywords, target_norm, reference_norm, smp_score = calculate_keywords('file1.txt', 'file2.txt') File "/path/to/my/myscript.py", line 9, in calculate_keywords target_text = f.read().lower().translate(str.maketrans('','','?!"():;.,“/[]')) MemoryError

#!/usr/bin/env python3

import collections
import math
import string

def calculate_keywords(target, reference):
    with open(target, 'r') as f:
        target_text = f.read().lower().translate(str.maketrans('','','?!"():;.,“/[]'))
        target_words = target_text.split()

    with open(reference, 'r') as f:
        reference_text = f.read().lower().translate(str.maketrans('','','?!"():;.,“/[]'))
        reference_words = reference_text.split()

    target_freq = collections.Counter(target_words)
    reference_freq = collections.Counter(reference_words)

    target_total = sum(target_freq.values())
    reference_total = sum(reference_freq.values())
    
    target_norm = {}
    reference_norm = {}

    for word, freq in target_freq.items():
        target_norm[word] = freq / target_total * 1000000

    for word, freq in reference_freq.items():
        reference_norm[word] = freq / reference_total * 1000000

    smp_scores = {}
    for word, freq in target_norm.items():
        if word not in reference_norm:
            reference_norm[word] = 0
        s1 = freq + 100
        s2 = reference_norm[word] + 100
        smp_scores[word] = s1 / s2

    keywords = sorted(smp_scores, key=smp_scores.get, reverse=True)[:50]
    return keywords, target_norm, reference_norm, smp_scores
    

keywords, target_norm, reference_norm, smp_score = calculate_keywords('myfile1.txt', 'myfile2.txt')
for word in keywords:
    print(f"{word} {target_norm[word]} {reference_norm[word]} {smp_score[word]}")
Michael Gauthier
  • 135
  • 1
  • 10
  • There should be a stack trace if the memory error is coming from Python. What do you actually see in the terminal when you run this? – DarkKnight Jun 13 '23 at 11:26
  • Thanks for your feedback @DarkKnight, I have just edited my post to add the exact error message. – Michael Gauthier Jun 13 '23 at 12:14
  • OK - So that's definitely coming from Python (not bash). Have you tried the revised code I provided? – DarkKnight Jun 13 '23 at 12:23
  • Thanks for the quick feedback once more, as well as for your explanations. I have just tried your revised code, but unfortunately I run into the same error message ~3 mins after I launch it. I will try again with all my other applications (i.e. Firefox) closed, to see if the extra memory makes a difference... – Michael Gauthier Jun 13 '23 at 12:28
  • @DarkKnight Ok, so closing all the other applications does not seem to help much more... : / Are there other solutions you can think of, ideally that would be accessible to non-tech people like me? – Michael Gauthier Jun 13 '23 at 12:39
  • Are both files of a similar size? – DarkKnight Jun 13 '23 at 14:52
  • @DarkKnight no, in that case, one is 3.2Gb, and the other is 919Mb. But in other contexts files may be more or less similar. – Michael Gauthier Jun 13 '23 at 14:55
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/254062/discussion-between-darkknight-and-michael-gauthier). – DarkKnight Jun 13 '23 at 15:00

2 Answers2

2

You might be able to reduce memory consumption by deleting target_words and reference_words after you've built the Counter objects (e.g., del target_words). These objects remain in scope and therefore cannot be garbage collected until calculate_keywords() has terminated. You could also achieve this without explicit use of del by writing discrete functions to handle some of the processing:

import collections

def get_counter(filename):
    with open(filename) as f:
        words = f.read().lower().translate(str.maketrans('','','?!"():;.,“/[]')).split()
        return collections.Counter(words)

def get_norm(filename):
    c = get_counter(filename)
    total = sum(c.values())
    return {word: freq / total * 1_000_000 for word, freq in c.items()}

def calculate_keywords(target, reference):
    target_norm = get_norm(target)
    reference_norm = get_norm(reference)

    smp_scores = {}

    for word, freq in target_norm.items():
        if word not in reference_norm:
            reference_norm[word] = 0
        s1 = freq + 100
        s2 = reference_norm[word] + 100
        smp_scores[word] = s1 / s2

    keywords = sorted(smp_scores, key=smp_scores.get, reverse=True)[:50]
    return keywords, target_norm, reference_norm, smp_scores
    

keywords, target_norm, reference_norm, smp_score = calculate_keywords('myfile1.txt', 'myfile2.txt')
for word in keywords:
    print(f"{word} {target_norm[word]} {reference_norm[word]} {smp_score[word]}")

This will potentially improve matters as the memory used in get_counter() and get_norm() goes out of scope and can therefore be released (garbage collected)

DarkKnight
  • 19,739
  • 3
  • 6
  • 22
0

Here is a working solution that's also very fast:

#!/usr/bin/env python3

import collections

def words_in_line(line):
    return line.lower().translate(str.maketrans('','','?!"():;.,“/[]')).split()
 
def get_counter(filename):
    Counter = collections.Counter()
    with open(filename) as file:
        for line in file:
            Counter.update(words_in_line(line))
    return Counter

def get_norm(filename):
    c = get_counter(filename)
    total = sum(c.values())
    return {word: freq / total * 1_000_000 for word, freq in c.items()}

def calculate_keywords(target, reference):
    target_norm = get_norm(target)
    reference_norm = get_norm(reference)

    smp_scores = {}

    for word, freq in target_norm.items():
        if word not in reference_norm:
            reference_norm[word] = 0
        s1 = freq + 100
        s2 = reference_norm[word] + 100
        smp_scores[word] = s1 / s2

    keywords = sorted(smp_scores, key=smp_scores.get, reverse=True)[:50]
    return keywords, target_norm, reference_norm, smp_scores
    

keywords, target_norm, reference_norm, smp_score = calculate_keywords('myfile1.txt', 'myfile2.txt')
for word in keywords:
    print(f"{word} {target_norm[word]} {reference_norm[word]} {smp_score[word]}")
Michael Gauthier
  • 135
  • 1
  • 10