Remove non-alphabet (preferably using lambda func or something else short but not for-loop)

Question

I have made the following code and basically it outputs how often all characters showed up in a file named 'Test'.

from os import strerror
from collections import Counter

try:
    with open ('Test', 'rt') as handle:
        content = handle.read().lower().replace(' ', '').replace('\n', '')
        counts = Counter(content)
    for i in sorted(counts, key=lambda x: counts[x], reverse=True)[:30]:
        print('{} -> {}'.format(i, counts[i]))
    
except IOError as e:
    print('I/O error occurred: ', strerror(e.errno))

The output is:

e -> 383
o -> 247
s -> 226
t -> 224
n -> 219
a -> 217
r -> 201
i -> 188
d -> 127
h -> 125
l -> 112
c -> 112
m -> 105
u -> 72
f -> 59
p -> 59
g -> 58
y -> 48
b -> 47
. -> 36
w -> 35
, -> 35
v -> 28
k -> 25
0 -> 15
- -> 9
% -> 8
1 -> 7
’ -> 7
x -> 7

Afterward I realized I just need the alphabets. I figured I have to modify line #6:

content = handle.read().lower().replace(' ', '').replace('\n', '')

I am aware I could just create a for-loop and using following conditional expresstion: str.isalpha() to remove non-alphabetic.

I wonder if there's other better ways to do that.

Thank you in advance for your feedback:-)

score 2 · Accepted Answer · answered Dec 24 '20 at 09:08

You can do it all in one go, using a generator expression or filter:

counts = Counter(filter(str.isalpha, handle.read().lower()))

Btw, you should also consider using Counter.most_common for your output:

for k, n in counts.most_common(30):
    print('{} -> {}'.format(k, n))

score 1 · Answer 2 · answered Dec 24 '20 at 09:16

You can replace this line:

content = handle.read().lower().replace(' ', '').replace('\n', '')

By this regex one-liner:

import re
content = re.sub("[^a-z/-]+", "", handle.read().lower())

In this way you'll remove spaces, newline and non-alphabetic characters in a single pass.

Remove non-alphabet (preferably using lambda func or something else short but not for-loop)

2 Answers2