Print a list of unique words from a text file after removing punctuation, and find longest word

Question

Goal is to a) print a list of unique words from a text file and also b) find the longest word.

I cannot use imports in this challenge.

File handling and main functionality are what I want, however the list needs to be cleaned. As you can see from the output, words are getting joined with punctuation and therefore maxLength is obviously incorrect.

with open("doc.txt") as reader, open("unique.txt", "w") as writer:

    unwanted = "[],."
    unique = set(reader.read().split())
    unique = list(unique) 
    unique.sort(key=len)
    regex = [elem.strip(unwanted).split() for elem in unique]
    writer.write(str(regex))
    reader.close()

    maxLength = len(max(regex,key=len ))
    print(maxLength)
    res = [word for word in regex if len(word) == maxLength]
    print(res)



===========

Sample:

pioneered the integrated placement year concept over 50 years ago [7][8][9] with more than 70 per cent of students taking a placement year, the highest percentage in the UK.[10]

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/213738/discussion-on-question-by-tymac-print-a-list-of-unique-words-from-a-text-file-af). — Samuel Liew, May 12 '20 at 23:35

smci · Accepted Answer · 2020-05-12T19:24:23.743

Here's a solution that uses str.translate() to throw away all bad characters (+ newline) before we ever do the split(). (Normally we'd use a regex with re.sub(), but you're not allowed.) This makes the cleaning a one-liner, which is really neat:

bad = "[],.\n"
bad_transtable = str.maketrans(bad, ' ' * len(bad))

# We can directly read and clean the entire output, without a reader object: 
cleaned_input = open('doc.txt').read().translate(bad_transtable)
#with open("doc.txt") as reader:
#    cleaned_input = reader.read().translate(bad_transtable)

# Get list of unique words, in decreasing length
unique_words = sorted(set(cleaned_input.split()), key=lambda w: -len(w))   

with open("unique.txt", "w") as writer:
    for word in unique_words:
        writer.write(f'{word}\n')

max_length = len(unique_words[0])
print ([word for word in unique_words if len(word) == max_length])

Notes:

since the input is already 100% cleaned and split, no need to append to a list/insert to a set as we go, then have to make another cleaning pass later. We can just create unique_words directly! (using set() to keep only uniques). And while we're at it, we might as well use sorted(..., key=lambda w: -len(w)) to sort it in decreasing length. Only need to call sort() once. And no iterative-append to lists.
hence we guarantee that max_length = len(unique_words[0])
this approach is also going to be more performant than nested loops for line in <lines>: for word in line.split(): ...iterative append() to wordlist
no need to do explicit writer/reader.open()/.close(), that's what the with statement does for you. (It's also more elegant for handling IO when exceptions happen.)
you could also merge the printing of the max_length words inside the writer loop. But it's cleaner code to keep them separate.
note we use f-string formatting f'{word}\n' to add the newline back when we write() an output line
in Python we use lower_case_with_underscores for variable names, hence max_length not maxLength. See PEP8
in fact here, we don't strictly need a with-statement for the writer, if all we're going to do is slurp its entire contents in one go in with open('doc.txt').read(). (That's not scaleable for huge files, you'd have to read in chunks or n lines).
str.maketrans() is a builtin, but if your teacher objects to the module reference, you can also call it on a bound string e.g. ' '.maketrans()
str.maketrans() is really a throwback to the days when we only had 95 printable ASCII characters, not Unicode. It still works on Unicode, but building and using huge translation dicts is annoying and uses memory, regex on Unicode is easier, you can define entire character classes.

Alternative solution if you don't yet know `str.translate()`

dirty_input = open('doc.txt').read()
cleaned_input = dirty_input
# If you can't use either 're.sub()' or 'str.translate()', have to manually
# str.replace() each bad char one-by-one (or else use a method like str.isalpha())
for bad_char in bad:
    cleaned_input = cleaned_input.replace(bad_char, ' ')

And if you wanted to be ridiculously minimalist, you could write the entire output file in one line with a list-comprehension. Don't do this, it would be terrible for debugging, e.g if you couldn't open/write/overwrite the output file, or got IOError, or unique_words wasn't a list, etc:

open("unique.txt", "w").writelines([f'{word}\n' for word in unique_words])

Thanks. Very nice. We haven't been shown translate yet so I'm wondering if it's doable in a similar fashion using replace? I like the separation of read/writer. — Edison, May 12 '20 at 18:37
tymac: added, at bottom. Yes, `str.translate()` is pretty powerful, so are regexes, working around them is quite annoying... — smci, May 12 '20 at 18:59
I noticed at the end of the final list there are numbers left over. They came from the footnote symbols `[3][5][54][2]` etc. How would I remove those since we are not using `imports`? — Edison, May 13 '20 at 01:22
How would I add a sort to sort alphabetically? I tried `sorted_unique_words = sorted(unique_words)` which worked by itself not not when it was combined with your split() line. — Edison, May 13 '20 at 13:33
What do you mean, `sorted(set(cleaned_input.split()))` does sort unique words into alphanumeric order, so numbers then upper-case then lower-case? Do you want case-insenstive? numbers at the end? etc. All of those have existing duplicate questions, please search through them or else post a new question. — smci, May 13 '20 at 23:21

JoshuaBox · Answer 2 · 2020-05-12T20:14:45.413

1

Here is a solution. The trick is to use the python str method .isalpha() to filter non-alphanumerics.

with open("unique.txt", "w") as writer:
    with open("doc.txt") as reader:
        cleaned_words = []
        for line in reader.readlines():
            for word in line.split():
                cleaned_word = ''.join([c for c in word if c.isalpha()])
                if len(cleaned_word):
                    cleaned_words.append(cleaned_word)

        # print unique words
        unique_words = set(cleaned_words)
        print(unique_words)

        # write words to file? depends what you need here
        for word in unique_words:
            writer.write(str(word))
            writer.write('\n')

        # print length of longest
        print(len(sorted(unique_words, key=len, reverse=True)[0]))

edited May 12 '20 at 20:14

answered May 12 '20 at 17:38

JoshuaBox

735
1
4
16

1

Thanks. p.s. Curious, is there another way without using `isalpha`? – Edison May 12 '20 at 17:47
1

Could you include the writer in your answer as well Josh? You were mentioning aesthetics. Thank you. – Edison May 12 '20 at 17:54
The OP's code also prints all (unique) words of longest length, not just their actual length. – smci May 12 '20 at 19:05
I have added back in the writer (I think @smci's example which only has the code to read and write within the relevant context is even cleaner). please adapt to how you'd like your output formatted. Also smci's solution below uses `.translate()` instead of `.isalpha()` – JoshuaBox May 12 '20 at 20:21

score 1 · Answer 3 · answered May 12 '20 at 17:40

Here is another solution without any function.

bad = '`~@#$%^&*()-_=+[]{}\|;\':\".>?<,/?'

clean = ' '
for i in a:
    if i not in bad:
        clean += i
    else:
        clean += ' '

cleans = [i for i in clean.split(' ') if len(i)]

clean_uniq = list(set(cleans))

clean_uniq.sort(key=len)

print(clean_uniq)
print(len(clean_uniq[-1]))

Print a list of unique words from a text file after removing punctuation, and find longest word

3 Answers3

Alternative solution if you don't yet know str.translate()

Alternative solution if you don't yet know `str.translate()`