0

I am trying to count the number of specific words in a given report. Does anyone know why defining a list within the code makes the second part of the following code run faster than reading the list from a file? Is there a solution? The list contains the same words is a lot longer than two words in the following example.

# Example code: Within code list
import csv
import glob
import re
import time

TARGET_FILES = r'C:/Users/s170760/Desktop/Reports_Cleaned/*.*'

OUTPUT_FILE = r'C:/Users/s170760/Desktop/Parser.csv'

OUTPUT_FIELDS = ['file name', 'create']

create = {'agile', 'skills'}

def main():

    f_out = open(OUTPUT_FILE, 'w')
    wr = csv.writer(f_out, lineterminator='\n')
    wr.writerow(OUTPUT_FIELDS)

    file_list = glob.glob(TARGET_FILES)
    for file in file_list:
        print(file)
        with open(file, 'r', encoding='UTF-8', errors='ignore') as f_in:
            doc = f_in.read()
        doc = doc.lower()
        output_data = get_data(doc)
        output_data[0] = file
        wr.writerow(output_data)

def get_data(doc):
    _odata = [0] * 2
    
    tokens = re.findall('\w(?:[-\w]*\w)?', doc)
    for token in tokens:
        if token in create:
            _odata[1] += 1
    return _odata

Here is the other way:

# Example code: Reading list from a file
import csv
import glob
import re
import time

TARGET_FILES = r'C:/Users/s170760/Desktop/Reports_Cleaned/*.*'

OUTPUT_FILE = r'C:/Users/s170760/Desktop/Parser.csv'

OUTPUT_FIELDS = ['file name', 'create']

create = open('C:/Users/s170760/Desktop/Create.txt', 'r').read().splitlines()

def main():

    f_out = open(OUTPUT_FILE, 'w')
    wr = csv.writer(f_out, lineterminator='\n')
    wr.writerow(OUTPUT_FIELDS)

    file_list = glob.glob(TARGET_FILES)
    for file in file_list:
        print(file)
        with open(file, 'r', encoding='UTF-8', errors='ignore') as f_in:
            doc = f_in.read()
        doc = doc.lower()
        output_data = get_data(doc)
        output_data[0] = file
        wr.writerow(output_data)

def get_data(doc):
    _odata = [0] * 2
    
    tokens = re.findall('\w(?:[-\w]*\w)?', doc)
    for token in tokens:
        if token in create:
            _odata[1] += 1
    return _odata
Mansoor
  • 45
  • 4
  • 3
    Your first example makes a set of strings (`{'agile', 'skills'}`), the second example is a list of strings. Testing if something is in a set (`if token in create`) is fast, testing if something is in a long list requires looping through the list. That won't be noticeable in a short list, but could be in a long one. – Mark Oct 14 '21 at 19:54
  • 2
    What you defined in the first snippet is a set, not a list. Opening a file to read from is definitely going to be slower than defining a 2 member set. Also, can you fix the missing single quotes in the snippets so we can get proper code highlighting? – Alex Oct 14 '21 at 19:55

1 Answers1

0

As pointed out by Mark in the comments, the first code snippet uses a set of strings, while the second code snippet loads a file into a list of strings.

Why sets are faster than lists in this use case, is well explained in this Stack Overflow answer. Parsing the output of open to a set can indeed solve your problem.

So replace:

create = open('C:/Users/s170760/Desktop/Create.txt', 'r').read().splitlines()

With:

create = set(open('C:/Users/s170760/Desktop/Create.txt', 'r').read().splitlines())
Wouter
  • 534
  • 3
  • 14
  • 22