Python: Parallely Fast Dictionary search of a given word list + List Enumeration + Do Something()

Question

So I thought this Title would produce good search results. Anyway, given the following code: It takes a one yield word as word from text_file_reader_gen() and iterates over under while loop until error where exception is given (Is there a better way than that other than try except?) and the interlock function just mixes them up.

def wordparser():
#word_freq={}
word=text_file_reader_gen()
word.next()
wordlist=[]
index=0
while True: #for word in ftext:
    try:
        #print 'entered try'
        current=next(word)
        wordlist.append(current) #Keep adding new words
        #word_freq[current]=1
        if len(wordlist)>2:
            while index < len(wordlist)-1:
                #print 'Before: len(wordlist)-1: %s || index: %s' %(len(wordlist)-1, index)
                new_word=interlock_2(wordlist[index],wordlist[index+1]) #this can be any do_something() function, irrelevant and working fine
                new_word2=interlock_2(wordlist[index+1],wordlist[index])
                print new_word,new_word2
                '''if new_word in word_freq:
                    correct_interlocked_words.append(new_word)
                if new_word2 in word_freq:
                    correct_interlocked_words.append(new_word2)'''
                index+=1
                #print 'After: len(wordlist)-1: %s || index: %s' %(len(wordlist)-1, index)
            '''if w not in word_freq:
                word_freq[w]=1
            else:
                word_freq[w]=+1'''
    except StopIteration,e:
        #print 'entered except'
        #print word_freq
        break
#return word_freq

text_file_reader_gen() code:

def text_file_reader_gen():
    path=str(raw_input('enter full file path \t:'))
    fin=open(path,'r')
    ftext=(x.strip() for x in fin)
    for word in ftext:
        yield word

Q1. Is it possible for word to be iterated and at the same time appending those word to the dictionary word_freq while at the same time enumerating over for key in word_freq where keys are words & are still being added, while the for loop runs and new words are mixed using the interlock function so that most of these iterations happen at one go- something like

while word.next() is not StopIteration: 
                word_freq[ftext.next()]+=1 if ftext not in word_freq #and
                for i,j in word_freq.keys():
                      new_word=interlock_2(j,wordlist[i+1])

I just wanted a very simple thing and a hash dict search, like really very fast because the txt file from where it is taking words is a-z very long, it may have duplicates as well.

Q2. Ways to improvise this existing code? Q3. Is there a way to 'for i,j in enumerate(dict.items())' so that i can reach dict[key] & dict[next_key] at the same time, although they are unordered, but that's also irrelevant.

UPDATE: After reviewing answers here, this is what I came up. It's working but I have a question regarding the following code:

def text_file_reader_gen():
    path=str(raw_input('enter full file path \t:'))
    fin=open(path,'r')
    ftext=(x.strip() for x in fin)
    return ftext #yield?


def wordparser():
    wordlist=[]
    index=0
    for word in text_file_reader_gen():

works but instead if I use yield ftext, it doesn't.

Q4. What is the basic difference and why does that happen?

A generator is an iterable so you can replace the `while` `try` `except` with simply: `for word in text_file_reader_gen() : # do stuff with word`. `word` just points to the value the generator gives you so you are free to append it/play around with it until it is reassigned in the next loop iteration. — ejrb, Apr 26 '13 at 11:26

score 1 · Answer 1 · answered Apr 26 '13 at 12:53

As far as I understand your example code, you're simply counting words. Take the following examples as ideas on which you can build on.

Q1. Yes and no. Running things in parallel is not trivial. You could use threading (GIL won't allow you true parallelism) or multiprocessing, but I don't see why you'd need to do this.

Q2. I don't understand the need for the text_file_reader_gen() function. Generators are iterators, you achieve the same thing by reading for line in file.

def word_parser():

    path = raw_input("enter full file path\t: ")
    words = {}
    with open(path, "r") as f:
        for line in f:
            for word in line.split():
                try:
                    words[word] += 1
                except KeyError:
                    words[word] = 1

    return words

The above goes through the file line by line, splits each line at whitespace and counts the word. It does not handle punctuation.

If your input files are natural language, you might want to take a look at the NTLK library. Here's another example that uses the collections library.

import collections
import string

def count_words(your_input):
    result = {}
    translate_tab = string.maketrans("","")
    with open(your_input, "r") as f:
        for line in f:
            result.update(collections.Counter(x.translate(translate_tab, string.punctuation) for x in line.split()))

    return result

 # Test.txt contains 5 paragraphs of Lorem Ipsum from some online generator
 In [61]: count_words("test.txt")
 Out[61]: 
 {'Aenean': 1,
  'Aliquam': 1,
  'Class': 1,
  'Cras': 1,
  'Cum': 1,
  'Curabitur': 2,
  'Donec': 1,
  'Duis': 1,
  'Etiam': 2,
  'Fusce': 1,
  'In': 1,
  'Integer': 1,
  'Lorem': 1,
  ......
  }

The function goes through the file line by line, creates a collections.Counter object – basically a sub-class of dict – splits each line by anything resembling whitespace, removes punctuation with string.translate and finally updates result dictionary with the Counter-dict. The Counter does all the ...counting.

Q3. Don't know why or how you'd achieve that.

score 0 · Accepted Answer · edited May 23 '17 at 11:43

Q3. Is there a way to 'for i,j in enumerate(dict.items())' so that i can reach dict[key] & dict[next_key] at the same time

You can get the next item in the iterable. So you can write a function to pair the current item with the next

Like this:

def with_next(thing):
    prev = next(thing)
    while True:
        try:
            cur = next(thing)
        except StopIteration, e:
            # There's no sane next item at the end of the iterable, so
            # use None.
            yield (prev, None)
            raise e
        yield (prev, cur)
        prev = cur

As the comment says, it's not obvious what to do at the end of the list (where there is no "next key"), so it just returns None

For example:

for curitem, nextitem in with_next(iter(['mouse', 'cat', 'dog', 'yay'])):
    print "%s (next: %s)" % (curitem, nextitem)

Outputs this:

mouse (next: cat)
cat (next: dog)
dog (next: yay)
yay (next: None)

It'll work for any iterable (e.g dict.iteritems(), dict.iterkeys(), enumerate etc):

mydict = {'mouse': 'squeek', 'cat': 'meow', 'dog': 'woof'}
for cur_key, next_key in with_next(mydict.iterkeys()):
    print "%s (next: %s)" % (cur_key, next_key)

Regarding your update:

def text_file_reader_gen():
    path=str(raw_input('enter full file path \t:'))
    fin=open(path,'r')
    ftext=(x.strip() for x in fin)
    return ftext #yield?

Q4. What is the basic difference [between yield and return] and why does that happen?

yield and return are very different things.

return returns a value from the function, and then the function terminates.

yield turns the function into a "generator function". Instead of returning a single object and ending, a generator function outputs a series of objects, one each time yield is called.

Here are a bunch of good pages explaining generators:

The return statement works like it does in many other programming langages. Things like the official tutorial should explain it

if I do like ' yield ftext, next(ftext)' 'for word, nextword in text_file_reader_gen(): print word, nextword' i get 'enter full file path :C:\Users\Hero\Desktop\uastory.txt at 0x0000000007826900> oneself' still looking if i can implement a better code within the text_file_reader_gen function — user2290820, May 01 '13 at 14:36

Python: Parallely Fast Dictionary search of a given word list + List Enumeration + Do Something()

2 Answers2