1

I have a Python script that is used to parse emails from large documents. This script is using all my RAM on my machine and makes it lock up to where I have to restart it. I was wondering if there is a way I can limit this or maybe even have a pause after it gets done reading one file and providing some output. Any help would be great thank you.

#!/usr/bin/env python

# Extracts email addresses from one or more plain text files.
#
# Notes:
# - Does not save to file (pipe the output to a file if you want it saved).
# - Does not check for duplicates (which can easily be done in the terminal).
# - Does not save to file (pipe the output to a file if you want it saved).
# Twitter @Critical24 - DefensiveThinking.io 


from optparse import OptionParser
import os.path
import re

regex = re.compile(("([a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`"
                    "{|}~-]+)*(@|\sat\s)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(\.|"
                    "\sdot\s))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)"))

def file_to_str(filename):
    """Returns the contents of filename as a string."""
    with open(filename, encoding='utf-8') as f: #Added encoding='utf-8'
    return f.read().lower() # Case is lowered to prevent regex mismatches.

def get_emails(s):
    """Returns an iterator of matched emails found in string s."""
    # Removing lines that start with '//' because the regular expression
    # mistakenly matches patterns like 'http://foo@bar.com' as '//foo@bar.com'.
    return (email[0] for email in re.findall(regex, s) if not email[0].startswith('//'))

import os
not_parseble_files = ['.txt', '.csv']
for root, dirs, files in os.walk('.'):#This recursively searches all sub directories for files
for file in files:
    _,file_ext = os.path.splitext(file)#Here we get the extension of the file
    file_path = os.path.join(root,file)
    if file_ext in not_parseble_files:#We make sure the extension is not in the banned list 'not_parseble_files'
       print("File %s is not parseble"%file_path)
       continue #This one continues the loop to the next file
    if os.path.isfile(file_path):
        for email in get_emails(file_to_str(file_path)):
            print(email)
colidyre
  • 4,170
  • 12
  • 37
  • 53
Alex
  • 174
  • 1
  • 3
  • 14
  • 2
    How large are those files? Unless your pattern can span multiple lines, you could try reading the files line-by-line and applying it to each line, i.e. use file `f` as a generator instead of using `read` or `readlines`. – tobias_k Sep 14 '18 at 15:07
  • Also, I just noticed that your comment says that the script extracts from "plain text files", but `.txt` is in your list of _non_-parseable files. Should that be a list of _parseable_ files instead? – tobias_k Sep 14 '18 at 15:10
  • Probably I am still learning Python. The files size range from 800kb to 8 Gigs. – Alex Sep 14 '18 at 15:12
  • 1
    i found your problem `"([a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`" "{|}~-]+)*(@|\sat\s)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(\.|" "\sdot\s))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)"` – Nathan McCoy Sep 14 '18 at 15:13
  • 1
    Some of the indentation in the code in the question is broken. – Bryan Oakley Sep 14 '18 at 15:20
  • 1
    In a worst case, if you have an 8 gig file and read it into memory, you're using 8 gigs of memory (plus a bit of overhead). If you then try to parse that and return the parsed data, that could easily result in _another_ 8 gigs of memory being used. – Bryan Oakley Sep 14 '18 at 15:23
  • The kids today... they load whole files in memory like ram was unlimited, and then they wonder why their programs freeze the computer. When I started programming, the first things you learned was that ram was a very limited and precious resource and that you should NEVER ever try to read a whole file at once. – bruno desthuilliers Sep 14 '18 at 15:32
  • Thanks for the fix up above. The class I am taking hasn't even talked about memory yet. I am hoping to finish this class but it isnt what I thought. – Alex Sep 14 '18 at 19:43

1 Answers1

1

It seems like you are reading files with up to 8 GB into memory, using f.read(). Instead, you could try applying the regex to each line of the file, without ever having the entire file in memory.

with open(filename, encoding='utf-8') as f: #Added encoding='utf-8'
    return (email[0] for line in f
                     for email in re.findall(regex, line.lower())
                     if not email[0].startswith('//'))

This can still take a very long time, though. Also, I did not check your regex for possible problems.

tobias_k
  • 81,265
  • 12
  • 120
  • 179