0

I read the how to check dictionary words And I got the idea to check my text file using dictionaries. I have read the pyenchant instructions, and I thought that if I use get_tokenizer to give me back all the dictionary words in the text file.

So here is where I'm stuck: I want my program to give me all groups of dictionary words in the form of a paragraph. As soon as it encounters any junk characters, considers that a paragraph break, and ignores everything from there till it finds X number of consecutive words.

I want it to read a text file in the sequence of filename_nnn.txt, parse it, and write to parsed_filname_nnn.txt. I have not got around to doing any file manipulation.

What I have so far:

import enchant
from enchant.tokenize import get_tokenizer, HTMLChunker
dictSentCheck = get_tokenizer("en_US")
sentCheck = raw_input("Check Sentense: ")

def check_dictionary():
    outcome = dictCheck.check(wordCheck) 
    test = [w[0] for w in dictSentCheck(sentCheck)]

------ sample text -----

English cricket cuts ties with Zimbabwe Wednesday, 25 June, 2008 text<void(0);><void(0);> <void(0);>email <void(0);>print EMAIL THIS ARTICLE your name: your email address: recipient's name: recipient's email address: <;>add another recipient your comment: Send Mail<void(0);> close this form <http://ad.au.doubleclick.net/jump/sbs.com.au/worldnews;sz=300x250;tile=2;ord=123456789?> The England and Wales Cricket Board (ECB) announced it was suspending all ties with Zimbabwe and was cancelling Zimbabwe's tour of England next year.

The script should return:

English cricket cuts ties with Zimbabwe Wednesday

The England and Wales Cricket Board (ECB) announced it was suspending all ties with Zimbabwe and was cancelling Zimbabwe's tour of England next year

I accepted abarnert's response. Below is my final script. Note it is VERY inefficient, and should be cleaned up some. Also disclaimer I have not coded since college a LONG time ago.

import enchant
from enchant.tokenize import get_tokenizer
import os

def clean_files():
    os.chdir("TARGET_DIRECTORY")
    for files in os.listdir("."):
           #get the numbers out file names 
           file_number = files[files.rfind("_")+1:files.rfind(".")]

           #Print status to screen
           print "Working on file: ", files

           #Read and process original file
           original_file = open("name_"+file_number+".txt", "r+")
           read_original_file = original_file.read();

           #Start the parsing of the files
           token_words = tokenize_words(read_original_file)
           parse_result = ('\n'.join(split_on_angle_brackets(token_words,file_number)))
           original_file.close()

           #Commit changes to parsed file
           parsed_file = open("name_"+file_number+"_parse.txt", "wb")
           parsed_file.write(parse_result);
           parsed_file.close()

def tokenize_words(file_words):
    tokenized_sentences = get_tokenizer("en_US")
    word_tokens = tokenized_sentences(file_words)
    token_result = [w[0] for w in word_tokens]
    return token_result

def check_dictionary(dict_word):
    check_word = enchant.Dict("en_US")
    validated_word = check_word.check(dict_word)
    return validated_word

def split_on_angle_brackets(token_words, file_number):
    para = []
    bracket_stack = 0
    ignored_words_per_file = open("name_"+file_number+"_ignored_words.txt", "wb")
    for word in token_words:
        if bracket_stack:
            if word == 'gt':
                bracket_stack -= 1
            elif word == 'lt':
                bracket_stack += 1
        else:
            if word == 'lt':
                if len(para) >= 7:
                    yield ' '.join(para)
                para = []
                bracket_stack = 1
            elif word != 'amp':
                if check_dictionary(word) == True:
                    para.append(word)
                    #print "append ", word
                else:
                       print "Ignored word: ", word
                       ignored_words_per_file.write(word + " \n")
    if para:
        yield ' '.join(para)

    #Close opened files
    ignored_words_per_file.close()

clean_files()
Community
  • 1
  • 1
danipolo
  • 23
  • 1
  • 5
  • 2
    Is there a reason you're using an `'en_US'` tokenizer to parse British English? – abarnert Feb 13 '13 at 18:44
  • Are you really fetching the text with HTML entities in it instead of the actual HTML? – Wooble Feb 13 '13 at 18:48
  • What is `dictCheck` in your code? What do you consider "junk characters"? – abarnert Feb 13 '13 at 18:49
  • Well I will be parsing english, not sure if it will be british or US. Can I use both dictionaries then? I wanted to take out all hmtl links from the text. I think I was using HTMLChunker wrong. Junk characters: text<void(0);><void(0);> <void(0);>email <void(0);&gt – danipolo Feb 13 '13 at 18:50
  • Well, you're not using `HTMLChunker` at all in your code, so it's hard to say if you're using it wrong in different code that you haven't shown us… – abarnert Feb 13 '13 at 18:52
  • Meanwhile, just taking out all HTML links isn't going to do anything for, say, `print EMAIL THIS ARTICLE` and other strings of perfectly good English like that. – abarnert Feb 13 '13 at 18:54
  • I imported `HMLChunker` thinking I would use it later. But after I played with `get_tokenizer`, I think I may not need it. – danipolo Feb 13 '13 at 18:56
  • I'm ok with things like `print EMAIL THIS ARTICLE` being left behind. Although I may set the required number of consecutive words to 7, that way only 7 consecutive words or more would be written to parsed_filename.txt – danipolo Feb 13 '13 at 18:58

1 Answers1

0

I'm still not sure what exactly your problem is, or what your code is supposed to do.

But this line seems to be the key:

test = [w[0] for w in dictSentCheck(sentCheck)]

That gives you a list of all words. It includes things like lt and gt as words. And you want to strip out anything inside an lt and gt pair.

And, as you say in your comments, "I may set the required number of consecutive words to 7".

So, something like this:

def split_on_angle_brackets(words):
    para = []
    bracket_stack = 0
    for word in words:
        if bracket_stack:
            if word == 'gt':
                bracket_stack -= 1
            elif word == 'lt':
                bracket_stack += 1
        else:
            if word == 'lt':
                if len(para) >= 7:
                    yield ' '.join(para)
                para = []
                bracket_stack = 1
            else:
                para.append(word)
    if para:
        yield ' '.join(para)

If you use it with your sample data:

print('\n'.join(split_on_angle_brackets(test)))

You get this:

English cricket cuts ties with Zimbabwe Wednesday June text
print EMAIL THIS ARTICLE your name your email address recipient's name recipient's email address
add another recipient your comment Send Mail
The England and Wales Cricket Board ECB announced it was suspending all ties with Zimbabwe and was cancelling Zimbabwe's tour of England next year

That doesn't match your sample output, but I can't think of any rule that would provide your sample output, so instead I'm trying to implement the rule you described.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • Thank you for the help! I can't get yield to work/return anything. I replaced yield with result.append() to test out the function and at least I can see its working. Any idea why yield wouldn't work for me? – danipolo Feb 14 '13 at 21:04
  • `yield` doesn't "return" anything; it turns your function into a generator function. When you call it, you don't get a value, you get something which can be iterated over as if it were a sequence of values, one for each time you `yield`. In my code, I passed `split_on_angle_brackets(test)` to `join`, which does the iteration. But you can see what's going on by just doing `for thing in split_on_angle_brackets(test): print(test)`. – abarnert Feb 14 '13 at 21:19
  • If you can't get the hang of generator functions, you can just stick a `result=[]` at the top, replace both `yield` lines with `result.append` lines, and then add a `return result` at the end, which turns it into a function that builds the whole list at once and returns it. (A list can obviously be iterated over as a sequence of values too.) But it's worth doing some simpler generator examples to get the hang of it—this is one of those abstractions that, once you get it, makes programming much easier. (If I'm remembering the right one, http://dabeaz.com/generators/ is amazingly helpful.) – abarnert Feb 14 '13 at 21:23
  • OK, thank you. I have read up on generators, somehow I'm still failing :) At least I get an object with a hex value, that's progress! – danipolo Feb 14 '13 at 21:33
  • @danipolo: If the object you're getting looks like ``, try doing a `for i in g: print(g)`, or just `print(list(g))` to see the sequence that it's generating. If you just need to get this project done, feel free to switch to the list, and come back to learning generators later. The list code is slightly less efficient, and a few lines longer, but if you understand it and can get it to work, that's what matters. – abarnert Feb 14 '13 at 21:44
  • 1
    The caller is missing ")" should be print('\n'.join(split_on_angle_brackets(test))) Took me way too long to figure out. – danipolo Feb 14 '13 at 21:44
  • @danipolo: Good catch. (Off-by-one errors are just as bad in copy-paste as they are in loop code…) I'll fix it in the answer. Is it working for you now? – abarnert Feb 14 '13 at 21:53