Optimizing python script extracting and processing large data files

Question

I am new to python and naively wrote a python script for the following task:

I want to create a bag of words representation of multiple objects. Each object is basically a pair and bag of words representation of synopsis is to be made. So the object is converted to in the final documents.

Here is the script:

import re
import math
import itertools
from nltk.corpus import stopwords
from nltk import PorterStemmer
from collections import defaultdict
from collections import Counter
from itertools import dropwhile

import sys, getopt

inp = "inp_6000.txt"  #input file name
out = "bowfilter10"   #output file name
with open(inp,'r') as plot_data:
    main_dict = Counter()
    file1, file2 = itertools.tee(plot_data, 2)
    line_one = itertools.islice(file1, 0, None, 4)
    line_two = itertools.islice(file2, 2, None, 4)
    dictionary = defaultdict(Counter)
    doc_count = defaultdict(Counter)
    for movie_name, movie_plot in itertools.izip(line_one, line_two):
        movie_plot = movie_plot.lower()
        words = re.findall(r'\w+', movie_plot, flags = re.UNICODE | re.LOCALE)  #split words
        elemStopW = filter(lambda x: x not in stopwords.words('english'), words)   #remove stop words, python nltk
        for word in elemStopW:
            word = PorterStemmer().stem_word(word)   #use python stemmer class to do stemming
            #increment the word count of the movie in the particular movie synopsis
            dictionary[movie_name][word] += 1
            #increment the count of a partiular word in main dictionary which stores frequency of all documents.       
            main_dict[word] += 1
            #This is done to calculate term frequency inverse document frequency. Takes note of the first occurance of the word in the synopsis and neglect all other.
            if doc_count[word]['this_mov']==0:
                doc_count[word].update(count=1, this_mov=1);
        for word in doc_count:
            doc_count[word].update(this_mov=-1)
    #print "---------main_dict---------"
    #print main_dict
    #Remove all the words with frequency less than 5 in whole set of movies
    for key, count in dropwhile(lambda key_count: key_count[1] >= 5, main_dict.most_common()):
        del main_dict[key]
    #print main_dict
   .#Write to file
    bow_vec = open(out, 'w');
    #calculate the the bog vector and write it
    m = len(dictionary)
    for movie_name in dictionary.keys():
        #print movie_name
        vector = []
        for word in list(main_dict):
            #print word, dictionary[movie_name][word]
            x = dictionary[movie_name][word] * math.log(m/doc_count[word]['count'], 2)
            vector.append(x)
        #write to file
        bow_vec.write("%s" % movie_name)
        for item in vector:
            bow_vec.write("%s," % item)
        bow_vec.write("\n")

Format of the data file and additional information about data: The data file has the following format:

Movie Name. Empty line. Movie synopsis(on can assume the size to be around 150 words) Empty line.

Note:<*> are meant for representation.

Size of input the file:
The file size is around 200 MB.

As of now this script is taking around 10-12hrs on a 3 GHz Intel processor.

Note: I am looking for improvement in serial code. I know parallelization would improve it but I want look into it later. I want to take this opportunity to make this serial code more efficient.

Any help appreciated.

you should split your code into smaller pieces (functions), at the moment just reading it is a tedious task - readability counts, you know... — root, Apr 09 '13 at 14:49
@root I am new to python(usually code in C, C++), so just do not know how passing of arguments and returning arguments affect performance. So decided to just put arguments the global space. It was written in a hurry and I will improve as I continue using it...:) Just need some time. Although I have added comments for clarity, if anything is not clear feel free to ask. — Aman Deep Gautam, Apr 09 '13 at 15:58
While what you say about function calls is true, separating your code into blocks usually helps to think about it algorithmically. And I am pretty sure that with a bit of rethinking you should get at least a 10x speedup. Also if you separate your code into functions you can test them separately to pick the best solution - and if in the end function calls become a problem you can factor them out later. — root, Apr 09 '13 at 16:34
Non-scalar variables are passed by reference in Python, so calling functions will not have devastating impact on performance — volcano, Apr 19 '13 at 11:35

volcano · Answer 1 · 2013-04-19T11:36:57.637

First of all - try to drop regular expressions, they are heavy. My original advice was crappy - it would not have worked. Maybe, this will be more efficient

trans_table = string.maketrans(string.string.punctuation, 
                               ' '*len(string.punctuation)).lower()
words = movie_plot.translate(trans_table).split()

(An afterthought) I cannot test it, but I think that if you store the result of this call in a variable

stops = stopwords.words('english')

or probably better - convert it into set first (if the function does not return one)

stops = set(stopwords.words('english'))

you'll get some improvement too

(To answer your question in comment) Every function call consumes time; if you get large block of data than you don't utilize permanently - the waste of time may be huge As for set vs list - compare results:

In [49]: my_list = range(100)

In [50]: %timeit 10 in my_list
1000000 loops, best of 3: 193 ns per loop

In [51]: %timeit 101 in my_list
1000000 loops, best of 3: 1.49 us per loop

In [52]: my_set = set(my_list)

In [53]: %timeit 101 in my_set
10000000 loops, best of 3: 45.2 ns per loop

In [54]: %timeit 10 in my_set
10000000 loops, best of 3: 47.2 ns per loop

While we are at greasy details - here are measurements for split vs. RE

In [30]: %timeit words = 'This is a long; and meaningless - sentence'.split(split_let)
1000000 loops, best of 3: 271 ns per loop

In [31]: %timeit words = re.findall(r'\w+', 'This is a long; and meaningless - sentence', flags = re.UNICODE | re.LOCALE)
100000 loops, best of 3: 3.08 us per loop

Can you reasons for the second part of your answer. – Aman Deep Gautam Apr 09 '13 at 15:53 — Aman Deep Gautam, Apr 09 '13 at 15:53

volcano · Answer 2 · 2013-04-19T11:33:35.890

Another thing that may slow up performance - deleting from dictionary. Re-building dictionary may be much more efficient:

word_dict = {key: count for key, count in 
             takewhile(lambda key_count: itemgetter(1) >= 5, 
             main_dict.most_common())

On the whole, I am a bit lazy to get into all the details, but I using a little bit references may be more efficient. As far as I see, you don't need *doc_count* variable - it's redundant and inefficient, and re-evaluating it reduces your performance too. *main_dict.keys()* does the same - give you list of all words once.

This is a sketch of what I have in mind - I cannot prove that it's more efficient, but it certainly looks more pythonic

with open(inp,'r') as plot_data:
    word_dict = Counter()
    file1, file2 = itertools.tee(plot_data, 2)
    line_one = itertools.islice(file1, 0, None, 4)
    line_two = itetools.islice(file2, 2, None, 4)
    all_stop_words = stopwords.words('english')
    movie_dict = defaultdict(Counter)
    stemmer_func = PorterStemmer().stem_word 
    for movie_name, movie_plot in itertools.izip(line_one, line_two):
        movie_plot = movie_plot.lower()
        words = <see above - I am updating original post>
        all_words = [stemmer_func(word) for word in words 
                     if not word in all_stop_words]
        current_word_counter = Counter(all_words)
        movie_dict[movie_name].update(current_word_counter)
        word_dict.update(current_word_counter)

The last - dictionary is not a good variable name, it does not tell you what it contains

Optimizing python script extracting and processing large data files

2 Answers2

Linked