Here's a solution that uses str.translate()
to throw away all bad characters (+ newline) before we ever do the split()
. (Normally we'd use a regex with re.sub()
, but you're not allowed.) This makes the cleaning a one-liner, which is really neat:
bad = "[],.\n"
bad_transtable = str.maketrans(bad, ' ' * len(bad))
# We can directly read and clean the entire output, without a reader object:
cleaned_input = open('doc.txt').read().translate(bad_transtable)
#with open("doc.txt") as reader:
# cleaned_input = reader.read().translate(bad_transtable)
# Get list of unique words, in decreasing length
unique_words = sorted(set(cleaned_input.split()), key=lambda w: -len(w))
with open("unique.txt", "w") as writer:
for word in unique_words:
writer.write(f'{word}\n')
max_length = len(unique_words[0])
print ([word for word in unique_words if len(word) == max_length])
Notes:
- since the input is already 100% cleaned and split, no need to append to a list/insert to a set as we go, then have to make another cleaning pass later. We can just create
unique_words
directly! (using set()
to keep only uniques). And while we're at it, we might as well use sorted(..., key=lambda w: -len(w))
to sort it in decreasing length. Only need to call sort()
once. And no iterative-append to lists.
- hence we guarantee that
max_length = len(unique_words[0])
- this approach is also going to be more performant than nested loops
for line in <lines>: for word in line.split(): ...iterative append() to wordlist
- no need to do explicit
writer/reader
.open()/.close()
, that's what the with
statement does for you. (It's also more elegant for handling IO when exceptions happen.)
- you could also merge the printing of the max_length words inside the writer loop. But it's cleaner code to keep them separate.
- note we use f-string formatting
f'{word}\n'
to add the newline back when we write()
an output line
- in Python we use lower_case_with_underscores for variable names, hence
max_length
not maxLength
. See PEP8
- in fact here, we don't strictly need a with-statement for the writer, if all we're going to do is slurp its entire contents in one go in with
open('doc.txt').read()
. (That's not scaleable for huge files, you'd have to read in chunks or n lines).
str.maketrans()
is a builtin, but if your teacher objects to the module reference, you can also call it on a bound string e.g. ' '.maketrans()
str.maketrans()
is really a throwback to the days when we only had 95 printable ASCII characters, not Unicode. It still works on Unicode, but building and using huge translation dicts is annoying and uses memory, regex on Unicode is easier, you can define entire character classes.
Alternative solution if you don't yet know str.translate()
dirty_input = open('doc.txt').read()
cleaned_input = dirty_input
# If you can't use either 're.sub()' or 'str.translate()', have to manually
# str.replace() each bad char one-by-one (or else use a method like str.isalpha())
for bad_char in bad:
cleaned_input = cleaned_input.replace(bad_char, ' ')
And if you wanted to be ridiculously minimalist, you could write the entire output file in one line with a list-comprehension. Don't do this, it would be terrible for debugging, e.g if you couldn't open/write/overwrite the output file, or got IOError, or unique_words wasn't a list, etc:
open("unique.txt", "w").writelines([f'{word}\n' for word in unique_words])