0

Need some advice in improving the performance of my code.

I have two files ( Keyword.txt , description.txt ). Keyword.txt consists of list of keywords (11,000+ to be specific) and descriptions.txt consists of very large text descriptions(9,000+ ).

I am trying to read keywords from keyword.txt one at a time and check if the keyword exists in the description. If the keyword exists I am writing it to a new file. So this is like a many - to - many relationship ( 11,000 * 9,000).

Sample Keywords:

Xerox
VMWARE CLOUD

Sample Description(it's huge):

Planning and implementing entire IT Infrastructure. Cyberoam firewall implementation and administration in head office and branch office. Report generation and analysis. Including band width conception, internet traffic and application performance. Windows 2003/2008 Server Domain controller implementation and managing. VERITAS Backup for Clients backup, Daily backup of applications and database. Verify the backed up database for data integrity. Send backup tapes to remote location for safe storage Installing and configuring various network devices; Routers, Modems, Access Points, Wireless ADSL+ modems / Routers Monitoring, managing & optimizing Network. Maintaining Network Infrastructure for various clients. Creating Users and maintaining the Linux Proxy servers for clients. Trouble shooting, diagnosing, isolating & resolving Windows / Network Problems. Configuring CCTV camera, Biometrics attendance machine, Access Control System Kaspersky Internet Security / ESET NOD32

Below is the code which I've written:

import csv
import nltk
import re
wr = open(OUTPUTFILENAME,'w')
def match():
    c = 0
    ft = open('DESCRIPTION.TXT','r')
    ky2 = open('KEYWORD.TXT','r')
    reader = csv.reader(ft)
    keywords = []
    keyword_reader2 = csv.reader(ky2)
    for x in keyword_reader2: # Storing all the keywords to a list
        keywords.append(x[1].lower())

    string = ' '
    c = 0
    for row in reader:
        sentence = row[1].lower()
        id = row[0]
        for word in keywords:
            if re.search(r'\b{}\b'.format(re.escape(word.lower())),sentence):
                    string = string + id+'$'+word.lower()+'$'+sentence+ '\n'
                    c = c + 1
        if c > 5000:  # I am writing 5000 lines at a time.
            print("Batch printed")
            c = 0
            wr.write(string)
            string = ' '
    wr.write(string)
    ky2.close()
    ft.close()
    wr.close()

match()

Now this code takes around 120 min to complete. I tried a couple of ways to improve the speed.

  1. At first I was writing one line at a time, then I changed it to 5000 lines at a time since it is a small file and i can afford to put everything in memory. Did not see much improvement.
  2. I pushed everything to stdout and used pipe from console to append everything to file. This was even slower.

I want to know if there is a better way of doing this, since I may have done something wrong in the code.

My PC Specs : Ram : 15gb Processor: i7 4th gen

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Siddarth
  • 1,000
  • 1
  • 10
  • 17
  • One thing you can do is stop using regex (which significantly slows down the process). You can simply check `if word in sentence:` – Nir Alfasi Feb 19 '15 at 02:27
  • 1
    [this](http://stackoverflow.com/questions/4989198/python-find-regexp-in-a-file) or [this](http://stackoverflow.com/questions/10477294/how-do-i-search-for-a-pattern-within-a-text-file-using-python-combining-regex) shows couple of basic ways of regexing large files with a pointer to line kind of jumps. This should probably help you out. Additionally it's probably be much better to write a regex that detects either lower or uppercase (compile with "i" I think) and returns true than lowercasing your strings (immutables have to be copied every time and it saves you from having "sentence") – ljetibo Feb 19 '15 at 02:28
  • @alfasin That will look for sequence of characters in the sentence not specific words. For example if I have keyword as **pipe** and description has something like "~ i help establish a data **pipeline** ~" even then it will return true. That was the reason I moved to regex. – Siddarth Feb 19 '15 at 02:30
  • @Siddarth so filter the results with regex. It'll definitely be much quicker ;) – Nir Alfasi Feb 19 '15 at 02:36
  • @Siddarth From what I've seen, just jump right into mmap. That seems to be exactly the thing you're looking for. – ljetibo Feb 19 '15 at 02:49
  • @ljetibo `data = mmap.mmap(f.fileno(), size, access=mmap.ACCESS_READ)` can i iterate line by line on data? – Siddarth Feb 19 '15 at 02:52
  • 1
    @Siddarth Yes. [Read this](https://docs.python.org/2/library/mmap.html). "Memory-mapped file objects behave like both strings and like file objects." that means you act on them just like on files, including `for line in` I would think ;). `Seek`, slicing notation etc also supported... – ljetibo Feb 19 '15 at 02:54

2 Answers2

2

I'm guessing you want to make your searches faster. In which case, if you don't care about the frequency of the keywords in the description, only that they exist, you could try the following:

For each description file, split the text into individual words, and generate a set of unique words.

Then, for each keyword in your list of keywords, check if the set contains keyword, write to file if true.

That should make your iterations faster. It should also help you skip the regexes, which are also likely part of your performance problem.

PS: My approach assumes that you filter out punctuation.

  • thanks @gorton fishman I initially thought about this approach. But some of my keywords have two or three words in them, for example **vmware cloud**. If I tokenize the description then it would tokenize this into ['vmware','cloud']. I want to see if 'vmware cloud' exists as it is in the description. – Siddarth Feb 19 '15 at 02:42
  • In this case, you might want to try generating n-grams from the text, this of course will require some more preprocessing, but the search itself should be unaffected http://en.wikipedia.org/wiki/N-gram – Gorton Fishman Feb 19 '15 at 02:44
  • Hmm yeah! does nltk provide a n-gram tokenizer? – Siddarth Feb 19 '15 at 02:46
  • I don't think there's a tokenizer that generates the grams, you'd have to iterate over it yourself, but you could also try the collocations API. Of course, there's no 100% guarantee that the candidates it selects would correlate to your keywords, however it's pretty likely it would if you don't have ultra complicated keywords. http://www.nltk.org/howto/collocations.html – Gorton Fishman Feb 19 '15 at 03:18
2

If all your search-word phrases consist of whole words (begin/end on a word boundary) then parallel indexing into a word tree would about as efficient as it gets.

Something like

# keep lowercase characters and digits
# keep apostrophes for contractions (isn't, couldn't, etc)
# convert uppercase characters to lowercase
# replace all other printable symbols with spaces
TO_ALPHANUM_LOWER = str.maketrans(
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ'!#$%&()*+,-./:;<=>?@[]^_`{|}~ \t\n\r\x0b\x0c\"\\",
    "abcdefghijklmnopqrstuvwxyz'                                     "
)

def clean(s):
    """
    Convert string `s` to canonical form for searching
    """
    return s.translate(TO_ALPHANUM_LOWER)

class WordTree:
    __slots__ = ["children", "terminal"]

    def __init__(self, phrases=None):
        self.children = {}   # {"word": WordTrie}
        self.terminal = ''   # if end of search phrase, full phrase is stored here
        # preload tree
        if phrases:
            for phrase in phrases:
                self.add_phrase(phrase)

    def add_phrase(self, phrase):
        tree  = self
        words = clean(phrase).split()
        for word in words:
            ch = tree.children
            if word in ch:
                tree = ch[word]
            else:
                tree = ch[word] = WordTree()
        tree.terminal = " ".join(words)

    def inc_search(self, word):
        """
        Search one level deeper into the tree

        Returns
          (None,    ''    )  if word not found
          (subtree, ''    )  if word found but not terminal
          (subtree, phrase)  if word found and completes a search phrase
        """
        ch = self.children
        if word in ch:
            wt = ch[word]
            return wt, wt.terminal
        else:
            return (None, '')

    def parallel_search(self, text):
        """
        Return all search phrases found in text
        """
        found  = []
        fd = found.append
        partials = []
        for word in clean(text).split():
            new_partials = []
            np = new_partials.append
            # new search from root
            wt, phrase = self.inc_search(word)
            if wt:     np(wt)
            if phrase: fd(phrase)
            # continue existing partial matches
            for partial in partials:
                wt, phrase = partial.inc_search(word)
                if wt:     np(wt)
                if phrase: fd(phrase)
            partials = new_partials
        return found

    def tree_repr(self, depth=0, indent="  ", terminal=" *"):
        for word,tree in self.children.items():
            yield indent * depth + word + (terminal if tree.terminal else '')
            yield from tree.tree_repr(depth + 1, indent, terminal)

    def __repr__(self):
        return "\n".join(self.tree_repr())

then your program becomes

import csv

SEARCH_PHRASES = "keywords.csv"
SEARCH_INTO    = "descriptions.csv"
RESULTS        = "results.txt"

# get search phrases, build WordTree
with open(SEARCH_PHRASES) as inf:
    wt = WordTree(*(phrase for _,phrase in csv.reader(inf)))

with open(SEARCH_INTO) as inf, open(RESULTS, "w") as outf:
    # bound methods (save some look-ups)
    find_phrases = wt.parallel_search
    fmt          = "{}${}${}\n".format
    write        = outf.write
    # sentences to search
    for id,sentence in csv.reader(inf):
        # search phrases found
        for found in find_phrases(sentence):
            # store each result
            write(fmt(id, found, sentence))

which should be something like a thousand times faster.

Hugh Bothwell
  • 55,315
  • 8
  • 84
  • 99