I read the how to check dictionary words
And I got the idea to check my text file using dictionaries. I have read the pyenchant instructions, and I thought that if I use get_tokenizer
to give me back all the dictionary words in the text file.
So here is where I'm stuck: I want my program to give me all groups of dictionary words in the form of a paragraph. As soon as it encounters any junk characters, considers that a paragraph break, and ignores everything from there till it finds X number of consecutive words.
I want it to read a text file in the sequence of filename_nnn.txt
, parse it, and write to parsed_filname_nnn.txt
. I have not got around to doing any file manipulation.
What I have so far:
import enchant
from enchant.tokenize import get_tokenizer, HTMLChunker
dictSentCheck = get_tokenizer("en_US")
sentCheck = raw_input("Check Sentense: ")
def check_dictionary():
outcome = dictCheck.check(wordCheck)
test = [w[0] for w in dictSentCheck(sentCheck)]
------ sample text -----
English cricket cuts ties with Zimbabwe Wednesday, 25 June, 2008 text<void(0);><void(0);> <void(0);>email <void(0);>print EMAIL THIS ARTICLE your name: your email address: recipient's name: recipient's email address: <;>add another recipient your comment: Send Mail<void(0);> close this form <http://ad.au.doubleclick.net/jump/sbs.com.au/worldnews;sz=300x250;tile=2;ord=123456789?> The England and Wales Cricket Board (ECB) announced it was suspending all ties with Zimbabwe and was cancelling Zimbabwe's tour of England next year.
The script should return:
English cricket cuts ties with Zimbabwe Wednesday
The England and Wales Cricket Board (ECB) announced it was suspending all ties with Zimbabwe and was cancelling Zimbabwe's tour of England next year
I accepted abarnert's response. Below is my final script. Note it is VERY inefficient, and should be cleaned up some. Also disclaimer I have not coded since college a LONG time ago.
import enchant
from enchant.tokenize import get_tokenizer
import os
def clean_files():
os.chdir("TARGET_DIRECTORY")
for files in os.listdir("."):
#get the numbers out file names
file_number = files[files.rfind("_")+1:files.rfind(".")]
#Print status to screen
print "Working on file: ", files
#Read and process original file
original_file = open("name_"+file_number+".txt", "r+")
read_original_file = original_file.read();
#Start the parsing of the files
token_words = tokenize_words(read_original_file)
parse_result = ('\n'.join(split_on_angle_brackets(token_words,file_number)))
original_file.close()
#Commit changes to parsed file
parsed_file = open("name_"+file_number+"_parse.txt", "wb")
parsed_file.write(parse_result);
parsed_file.close()
def tokenize_words(file_words):
tokenized_sentences = get_tokenizer("en_US")
word_tokens = tokenized_sentences(file_words)
token_result = [w[0] for w in word_tokens]
return token_result
def check_dictionary(dict_word):
check_word = enchant.Dict("en_US")
validated_word = check_word.check(dict_word)
return validated_word
def split_on_angle_brackets(token_words, file_number):
para = []
bracket_stack = 0
ignored_words_per_file = open("name_"+file_number+"_ignored_words.txt", "wb")
for word in token_words:
if bracket_stack:
if word == 'gt':
bracket_stack -= 1
elif word == 'lt':
bracket_stack += 1
else:
if word == 'lt':
if len(para) >= 7:
yield ' '.join(para)
para = []
bracket_stack = 1
elif word != 'amp':
if check_dictionary(word) == True:
para.append(word)
#print "append ", word
else:
print "Ignored word: ", word
ignored_words_per_file.write(word + " \n")
if para:
yield ' '.join(para)
#Close opened files
ignored_words_per_file.close()
clean_files()