7

I have a long text file (a screenplay). I want to turn this text file into a list (where every word is separated) so that I can search through it later on.

The code i have at the moment is

file = open('screenplay.txt', 'r')
words = list(file.read().split())
print words

I think this works to split up all the words into a list, however I'm having trouble removing all the extra stuff like commas and periods at the end of words. I also want to make capital letters lower case (because I want to be able to search in lower case and have both capitalized and lower case words show up). Any help would be fantastic :)

Tom F
  • 443
  • 1
  • 4
  • 16

7 Answers7

7

This is a job for regular expressions!

For example:

import re
file = open('screenplay.txt', 'r')
# .lower() returns a version with all upper case characters replaced with lower case characters.
text = file.read().lower()
file.close()
# replaces anything that is not a lowercase letter, a space, or an apostrophe with a space:
text = re.sub('[^a-z\ \']+', " ", text)
words = list(text.split())
print words
Brionius
  • 13,858
  • 3
  • 38
  • 49
  • Glad to help. You may need to adjust the regex pattern based on what types of punctuation you do or do not want to survive in the final `words` list. – Brionius Aug 08 '13 at 21:20
4

A screenplay should be short enough to be read into memory in one fell swoop. If so, you could then remove all punctation using the translate method. Finally, you can produce your list simply by splitting on whitespace using str.split:

import string

with open('screenplay.txt', 'rb') as f:
    content = f.read()
    content = content.translate(None, string.punctuation).lower()
    words = content.split()

print words

Note that this will change Mr.Smith into mrsmith. If you'd like it to become ['mr', 'smith'] then you could replace all punctation with spaces, and then use str.split:

def using_translate(content):
    table = string.maketrans(
        string.punctuation,
        ' '*len(string.punctuation))
    content = content.translate(table).lower()
    words = content.split()
    return words

One problem you might encounter using a positive regex pattern such as [a-z]+ is that it will only match ascii characters. If the file has accented characters, the words would get split apart. Gruyère would become ['Gruy','re'].

You could fix that by using re.split to split on punctuation. For example,

def using_re(content):
    words = re.split(r"[ %s\t\n]+" % (string.punctuation,), content.lower())
    return words

However, using str.translate is faster:

In [72]: %timeit using_re(content)
100000 loops, best of 3: 9.97 us per loop

In [73]: %timeit using_translate(content)
100000 loops, best of 3: 3.05 us per loop
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
1

Use the replace method.

mystring = mystring.replace(",", "")

If you want a more elegent solution that you will use many times over read up on RegEx expressions. Most languages use them and they are extremely useful for more complicated replacements and such

Brian H
  • 1,033
  • 2
  • 9
  • 28
0

You can use a simple regexp for creating a set with all words (sequences of one or more alphabetic characters)

import re
words = set(re.findall("[a-z]+", f.read().lower()))

Using a set each word will be included just once.

Just using findall will instead give you all the words in order.

6502
  • 112,025
  • 15
  • 165
  • 265
0

You could use a dictionary to specify what characters you don't want, and format the current string based on your choices.

replaceChars = {'.':'',',':'', ' ':''}
print reduce(lambda x, y: x.replace(y, replaceChars[y]), replaceChars, "ABC3.2,1,\nCda1,2,3....".lower())

Output:

abc321
cda123
0

You can try something like this. Probably need some work on the regexp though.

import re
text = file.read()
words = map(lambda x: re.sub("[,.!?]", "", x).lower(), text.split())
MatLecu
  • 953
  • 8
  • 14
0

I have tried this code and It works in my case:

from string import punctuation, whitespace
s=''
with open("path of your file","r") as myfile:
    content=myfile.read().split()  
    for word in content:
        if((word in punctuation) or (word in whitespace)) :
            pass
        else:
            s+=word.lower()
print(s)