0

I am new to python and naively wrote a python script for the following task:

I want to create a bag of words representation of multiple objects. Each object is basically a pair and bag of words representation of synopsis is to be made. So the object is converted to in the final documents.

Here is the script:

import re
import math
import itertools
from nltk.corpus import stopwords
from nltk import PorterStemmer
from collections import defaultdict
from collections import Counter
from itertools import dropwhile

import sys, getopt

inp = "inp_6000.txt"  #input file name
out = "bowfilter10"   #output file name
with open(inp,'r') as plot_data:
    main_dict = Counter()
    file1, file2 = itertools.tee(plot_data, 2)
    line_one = itertools.islice(file1, 0, None, 4)
    line_two = itertools.islice(file2, 2, None, 4)
    dictionary = defaultdict(Counter)
    doc_count = defaultdict(Counter)
    for movie_name, movie_plot in itertools.izip(line_one, line_two):
        movie_plot = movie_plot.lower()
        words = re.findall(r'\w+', movie_plot, flags = re.UNICODE | re.LOCALE)  #split words
        elemStopW = filter(lambda x: x not in stopwords.words('english'), words)   #remove stop words, python nltk
        for word in elemStopW:
            word = PorterStemmer().stem_word(word)   #use python stemmer class to do stemming
            #increment the word count of the movie in the particular movie synopsis
            dictionary[movie_name][word] += 1
            #increment the count of a partiular word in main dictionary which stores frequency of all documents.       
            main_dict[word] += 1
            #This is done to calculate term frequency inverse document frequency. Takes note of the first occurance of the word in the synopsis and neglect all other.
            if doc_count[word]['this_mov']==0:
                doc_count[word].update(count=1, this_mov=1);
        for word in doc_count:
            doc_count[word].update(this_mov=-1)
    #print "---------main_dict---------"
    #print main_dict
    #Remove all the words with frequency less than 5 in whole set of movies
    for key, count in dropwhile(lambda key_count: key_count[1] >= 5, main_dict.most_common()):
        del main_dict[key]
    #print main_dict
   .#Write to file
    bow_vec = open(out, 'w');
    #calculate the the bog vector and write it
    m = len(dictionary)
    for movie_name in dictionary.keys():
        #print movie_name
        vector = []
        for word in list(main_dict):
            #print word, dictionary[movie_name][word]
            x = dictionary[movie_name][word] * math.log(m/doc_count[word]['count'], 2)
            vector.append(x)
        #write to file
        bow_vec.write("%s" % movie_name)
        for item in vector:
            bow_vec.write("%s," % item)
        bow_vec.write("\n")

Format of the data file and additional information about data: The data file has the following format:

Movie Name. Empty line. Movie synopsis(on can assume the size to be around 150 words) Empty line.

Note:<*> are meant for representation.

Size of input the file:
The file size is around 200 MB.

As of now this script is taking around 10-12hrs on a 3 GHz Intel processor.

Note: I am looking for improvement in serial code. I know parallelization would improve it but I want look into it later. I want to take this opportunity to make this serial code more efficient.

Any help appreciated.

TheCodeArtist
  • 21,479
  • 4
  • 69
  • 130
Aman Deep Gautam
  • 8,091
  • 21
  • 74
  • 130
  • 1
    you should split your code into smaller pieces (functions), at the moment just reading it is a tedious task - readability counts, you know... – root Apr 09 '13 at 14:49
  • @root I am new to python(usually code in C, C++), so just do not know how passing of arguments and returning arguments affect performance. So decided to just put arguments the global space. It was written in a hurry and I will improve as I continue using it...:) Just need some time. Although I have added comments for clarity, if anything is not clear feel free to ask. – Aman Deep Gautam Apr 09 '13 at 15:58
  • While what you say about function calls is true, separating your code into blocks usually helps to think about it algorithmically. And I am pretty sure that with a bit of rethinking you should get at least a 10x speedup. Also if you separate your code into functions you can test them separately to pick the best solution - and if in the end function calls become a problem you can factor them out later. – root Apr 09 '13 at 16:34
  • Non-scalar variables are passed by reference in Python, so calling functions will not have devastating impact on performance – volcano Apr 19 '13 at 11:35

2 Answers2

5

First of all - try to drop regular expressions, they are heavy. My original advice was crappy - it would not have worked. Maybe, this will be more efficient

trans_table = string.maketrans(string.string.punctuation, 
                               ' '*len(string.punctuation)).lower()
words = movie_plot.translate(trans_table).split()

(An afterthought) I cannot test it, but I think that if you store the result of this call in a variable

stops = stopwords.words('english')

or probably better - convert it into set first (if the function does not return one)

stops = set(stopwords.words('english'))

you'll get some improvement too

(To answer your question in comment) Every function call consumes time; if you get large block of data than you don't utilize permanently - the waste of time may be huge As for set vs list - compare results:

In [49]: my_list = range(100)

In [50]: %timeit 10 in my_list
1000000 loops, best of 3: 193 ns per loop

In [51]: %timeit 101 in my_list
1000000 loops, best of 3: 1.49 us per loop

In [52]: my_set = set(my_list)

In [53]: %timeit 101 in my_set
10000000 loops, best of 3: 45.2 ns per loop

In [54]: %timeit 10 in my_set
10000000 loops, best of 3: 47.2 ns per loop

While we are at greasy details - here are measurements for split vs. RE

In [30]: %timeit words = 'This is a long; and meaningless - sentence'.split(split_let)
1000000 loops, best of 3: 271 ns per loop

In [31]: %timeit words = re.findall(r'\w+', 'This is a long; and meaningless - sentence', flags = re.UNICODE | re.LOCALE)
100000 loops, best of 3: 3.08 us per loop
volcano
  • 3,578
  • 21
  • 28
0

Another thing that may slow up performance - deleting from dictionary. Re-building dictionary may be much more efficient:

word_dict = {key: count for key, count in 
             takewhile(lambda key_count: itemgetter(1) >= 5, 
             main_dict.most_common())

On the whole, I am a bit lazy to get into all the details, but I using a little bit references may be more efficient. As far as I see, you don't need *doc_count* variable - it's redundant and inefficient, and re-evaluating it reduces your performance too. *main_dict.keys()* does the same - give you list of all words once.

This is a sketch of what I have in mind - I cannot prove that it's more efficient, but it certainly looks more pythonic

with open(inp,'r') as plot_data:
    word_dict = Counter()
    file1, file2 = itertools.tee(plot_data, 2)
    line_one = itertools.islice(file1, 0, None, 4)
    line_two = itetools.islice(file2, 2, None, 4)
    all_stop_words = stopwords.words('english')
    movie_dict = defaultdict(Counter)
    stemmer_func = PorterStemmer().stem_word 
    for movie_name, movie_plot in itertools.izip(line_one, line_two):
        movie_plot = movie_plot.lower()
        words = <see above - I am updating original post>
        all_words = [stemmer_func(word) for word in words 
                     if not word in all_stop_words]
        current_word_counter = Counter(all_words)
        movie_dict[movie_name].update(current_word_counter)
        word_dict.update(current_word_counter)

The last - dictionary is not a good variable name, it does not tell you what it contains

volcano
  • 3,578
  • 21
  • 28