Using Input and Output files in python to determine repeated words

Question

I hope I am clear. I am trying to create a Python program that goes through the first file and determine what words are repeated. In order to determine if the words are repeated, the contents of the file must be stripped from punctuation and must be in lower case. After this is done, the program then writes the words that are repeated unto the second text file. The repeated words are to be written only once in the second file.

Below, I've made an attempt and I ran into two errors.

Error one: I've noticed that the punctuation strip function that was created does not fully remove all the punctuation.

Error two: The repeated words are written to the second folder as many times as they appear in the original. I attempted to use a break function if the word had already existed but it somehow bypasses the break function.

Below is my code.

def repeatWords(filename_1, filename_2):
    infile_1=open(filename_1,'r')
    content_1=infile_1.read()
    infile_1.close()
    import string
    content_1=content_1.strip(string.punctuation) # this did not remove all punctuations
    content_1=content_1.lower()
    content_1=content_1.split()


        outfile=open(filename_2,'w')
        outfile.write('') #used to create second file, assuming it does not exist
        outfile.close()

        outfile=open(filename_2,'r+')
        write_content=outfile.read()

    for word in content_1:
        write_content=outfile.read()
        if content_1.count(word)>1:
            if word in write_content:
                break
            else:
                outfile.write(word)
                outfile.write('\n')
    outfile.close()
    # after this is executed, the words repeat as many times as they appear

    infile_2=open(filename_2,'r')
    content_2=infile_2.read()
    infile_2.close()
    return content_2


inF = 'catInTheHat.txt'
outF = 'catRepWords.txt'
print(repeatWords(inF, outF))

The contents in the first file is:

Too wet to go out and too cold to play ball.
So we sat in the house.
We did nothing at all.
So all we could do was to Sit! Sit! Sit! Sit!

Screenshot link --> http://oi59.tinypic.com/hrln3r.jpg

The reason your punctuation stripping fails is because `strip` only removes punctuation at the start and end of a string. — iobender, Oct 26 '15 at 05:38

score 1 · Answer 1 · answered Oct 26 '15 at 04:53

I believe the code below does what you need. The line that starts with "words" removes punctuation and breaks the string into a list of words. I then use two sets to keep track of words that appear more than once.

import string

inFile = r'C:\Users\user\Desktop\in.txt'
outFile = r'C:\Users\user\Desktop\out.txt'

with open(inFile,'r') as f:
    inStr = f.read()

exclude = set(string.punctuation)
words = ''.join(ch for ch in inStr if ch not in exclude).lower().split()

alreadySeen = set()
multiples = set()
for word in words:
    if word in alreadySeen:
        multiples.add(word)
    else:
        alreadySeen.add(word)

with open(outFile,'w') as f:
    f.write('\n'.join(multiples))

score 0 · Answer 2 · edited May 23 '17 at 12:14

Counters are your best friend. You can give in a list and it will count its elements for you :)

You can solve the punctuation problem by using the string's very efficient translate function: Best way to strip punctuation from a string in Python

Also you can use thew+ file open mode to create and open a file to write to instead of doing what you were above.

import collections

def repeatWords(filename_1, filename_2):
    infile_1=open(filename_1,'r')
    content_1=infile_1.read()
    infile_1.close()
    import string
    content_1 = content_1.translate(string.maketrans("",""), string.punctuation)
    content_1=content_1.lower()
    content_1=content_1.split()
    content_2 = []

    c = collections.Counter(content_1)

    with open(filename_2,'w+') as outfile:
        for word in c:
            if(c[word] > 1):
                outfile.write(word + "\n")
                content_2.append(word)

    return content_2


inF = 'in.txt'
outF = 'out.txt'
print(repeatWords(inF, outF))

Output:

>>>python so.py
['all', 'we', 'sit', 'to', 'too', 'so']

out.txt:

all
we
sit
to
too
so

flamenco · Answer 3 · 2015-10-26T05:38:24.380

Use re to remove punctuation from a string.
Use collections to find duplicates
Recommended to use with statement when open a file as it will make sure that it will close it for you

Try this:

import re
import collections

with open("catInTheHat.txt", "r")  as f:
    data = [re.sub(r"['_,!\-\"\\\/}{?\.]", '', item) for item in f.read().replace('\n', ' ').split(" ")]
    print data

duplicates = [re.sub(r'[^\w\s]','',item) for item, count in collections.Counter(data).items() if count > 1]

with open("catRepWords.txt", "w") as ff:
    for word in duplicates:
        ff.write(word + '\n')

Output:

all
Sit
to
we
So

If you want to treat Too and too (and similar) as being the same word change data to be:

data = [re.sub(r"['_,!\-\"\\\/}{?\.]", '', item).lower() for item in f.read().replace('\n', ' ').split(" ")]

score 0 · Accepted Answer · answered Oct 26 '15 at 15:53

I thank you all for the input. I was able to fix my errors. I re-iterate through each character and strip the punctuation and i kept creating new lists to add repeated words then keep a single occurrence. The response that I have seem to work but I was not introduced to the higher order programming as yet.

Here is the modified code:

def repeatWords(filename_1, filename_2):
    infile_1=open(filename_1,'r')
    content_1=infile_1.read()
    infile_1.close()

    import string
    new_content=""
    for char in content_1:
        new_content+=char.strip(string.punctuation)

    new_content=new_content.lower()
    new_content=new_content.split()

    repeated_list=[]

    for word in new_content:
        if new_content.count(word)>1:
            repeated_list.append(word)

    new_repeat_list=[]

    for item in repeated_list:
       while item not in new_repeat_list:
          new_repeat_list.append(item)

    outfile=open(filename_2,'w')
    for repeat in new_repeat_list:
        outfile.write(repeat)
        outfile.write('\n')
    outfile.close()

    infile_2=open(filename_2,'r')
    content_2=infile_2.read()
    infile_2.close()
    return content_2

inF = 'catInTheHat.txt'
outF = 'catRepWords.txt'
print(repeatWords(inF, outF))

Using Input and Output files in python to determine repeated words

4 Answers4