A screenplay should be short enough to be read into memory in one fell swoop. If so, you could then remove all punctation using the translate
method. Finally, you can produce your list simply by splitting on whitespace using str.split
:
import string
with open('screenplay.txt', 'rb') as f:
content = f.read()
content = content.translate(None, string.punctuation).lower()
words = content.split()
print words
Note that this will change Mr.Smith
into mrsmith
. If you'd like it to become ['mr', 'smith']
then you could replace all punctation with spaces, and then use str.split
:
def using_translate(content):
table = string.maketrans(
string.punctuation,
' '*len(string.punctuation))
content = content.translate(table).lower()
words = content.split()
return words
One problem you might encounter using a positive regex pattern such as [a-z]+
is that it will only match ascii characters. If the file has accented characters, the words would get split apart.
Gruyère
would become ['Gruy','re']
.
You could fix that by using re.split
to split on punctuation.
For example,
def using_re(content):
words = re.split(r"[ %s\t\n]+" % (string.punctuation,), content.lower())
return words
However, using str.translate
is faster:
In [72]: %timeit using_re(content)
100000 loops, best of 3: 9.97 us per loop
In [73]: %timeit using_translate(content)
100000 loops, best of 3: 3.05 us per loop