My code is behaving strangely, and I have a feeling it has to do with the regular expressions i'm using.
I'm trying to determine the number of total words, number of unique words, and number of sentences in a text file.
Here is my code:
import sys
import re
file = open('sample.txt', 'r')
def word_count(file):
words = []
reg_ex = r"[A-Za-z0-9']+"
p = re.compile(reg_ex)
for l in file:
for i in p.findall(l):
words.append(i)
return len(words), len(set(words))
def sentence_count(file):
sentences = []
reg_ex = r'[a-zA-Z0-9][.!?]'
p = re.compile(reg_ex)
for l in file:
for i in p.findall(l):
sentences.append(i)
return sentences, len(sentences)
sentence, sentence_count = sentence_count(file)
word_count, unique_word_count = word_count(file)
print('Total word count: {}\n'.format(word_count) +
'Unique words: {}\n'.format(unique_word_count) +
'Sentences: {}'.format(sentence_count))
The output is the following:
Total word count: 0
Unique words: 0
Sentences: 5
What is really strange is that if I comment out the sentence_count()
function, the word_count()
function starts working and outputs the correct numbers.
Why is this inconsistency happening? If I comment out either function, one will output the correct value while the other will output 0's. Can someone help me such that both functions work?