1

I have two files check.txt and orig.txt. I want to check every word in check.txt and see if it matches with any word in orig.txt. If it does match then the code should replace that word with its first match otherwise it should leave the word as it is. But somehow its not working as required. Kindly help.

check.txt looks like this:

ukrain

troop

force

and orig.txt looks like:

ukraine cnn should stop pretending & announce: we will not report news while it reflects bad on obama @bostonglobe @crowleycnn @hardball

rt @cbcnews: breaking: .@vice journalist @simonostrovsky, held in #ukraine now free and safe http://t.co/sgxbedktlu http://t.co/jduzlg6jou

russia 'outraged' at deadly shootout in east #ukraine -  moscow:... http://t.co/nqim7uk7zg
 #groundtroops #russianpresidentvladimirputin

http://pastebin.com/XJeDhY3G

f = open('check.txt','r')
orig = open('orig.txt','r')
new = open('newfile.txt','w')

for word in f:
    for line in orig:
        for word2 in line.split(" "):
            word2 = word2.lower()            
            if word in word2:
                word = word2
            else:
                print('not found')
        new.write(word)
Grijesh Chauhan
  • 57,103
  • 20
  • 141
  • 208
Nabila Shahid
  • 419
  • 1
  • 6
  • 13
  • *"If it does match then the code should replace that word with its first match otherwise it should leave the word as it is"* what should be replaced? in original file or in check file? – Grijesh Chauhan Apr 25 '14 at 11:53
  • Note if you opens a file in read more using 'r' then can't write in that file. – Grijesh Chauhan Apr 25 '14 at 11:54
  • `for word in f: for line in orig` If the second loop loops over lines, then what would the first loop loop over? – tobias_k Apr 25 '14 at 11:54
  • @GrijeshChauhan i have replaces 'r' with 'w' still its not working – Nabila Shahid Apr 25 '14 at 11:57
  • @user3571809 if you replace `"r"` with `"w"` then new empty file will be created.. you can't read from that file. -- so still wrong. Your question is unclear at first, there are many fundamental mistakes as well, I suggest you read some more stuff first on file and basic Python. – Grijesh Chauhan Apr 25 '14 at 11:58
  • 2
    It would help if you provided some example inputs and expected outputs. Now we're just guessing. – msvalkon Apr 25 '14 at 11:59
  • @tobias_k check.txt only has single words on each line but orig.txt has sentences at each line so loops have to be like this – Nabila Shahid Apr 25 '14 at 11:59
  • @GrijeshChauhan now i'm writing in newfile.txt ... still not working – Nabila Shahid Apr 25 '14 at 12:01
  • @user3571809 from your inner most loop it looks your logic is not clear, you are assigning `word = word2` multiple times, Additionally either remove inner-for-loop and just do `word in line` or use `word == word2` if you wants to check exact mach. – Grijesh Chauhan Apr 25 '14 at 12:06
  • @GrijeshChauhan i don't want exact matches... i'm trying to replace my stemmed words with the closest word in orig.txt – Nabila Shahid Apr 25 '14 at 12:09
  • @GrijeshChauhan how do i break the loop as after i have assigned word=word2 for the first time... i'm new to python so i'm having difficulty figuring this out – Nabila Shahid Apr 25 '14 at 12:11
  • @user3571809 As I said in my first comment, Please also add your expected output in Question. – Grijesh Chauhan Apr 25 '14 at 12:13
  • hey as per ur example do u want to replace ukrain with ukraine right?? – sundar nataraj Apr 25 '14 at 12:17
  • 2
    check this might be useful https://docs.python.org/2/library/difflib.html#difflib.get_close_matches – sundar nataraj Apr 25 '14 at 12:19

1 Answers1

1

There are two problems with your code:

  1. when you loop over the words in f, each word will still have a new line character, so your in check does not work
  2. you want to iterate orig for each of the words from f, but files are iterators, being exhausted after the first word from f

You can fix those by doing word = word.strip() and orig = list(orig), or you can try something like this:

# get all stemmed words
stemmed = [line.strip() for line in f]
# set of lowercased original words
original = set(word.lower() for line in orig for word in line.split())
# map stemmed words to unstemmed words
unstemmed = {word: None for word in stemmed}
# find original words for word stems in map
for stem in unstemmed:
    for word in original:
        if stem in word:
            unstemmed[stem] = word
print unstemmed

Or shorter (without that final double loop), using difflib, as suggested in the comments:

unstemmed = {word: difflib.get_close_matches(word, original, 1) for word in stemmed}

Also, remember to close your files, or use the with keyword to close them automatically.

Community
  • 1
  • 1
tobias_k
  • 81,265
  • 12
  • 120
  • 179