-2

I am trying to remove punctuation and lowercase a long string (taken from a text file).

I have an example text file, like so:

This. this, Is, is. An; an, Example. example! Sentence? sentence.

I then have the following script:

def get_input(filepath):
    f = open(filepath, 'r')
    content = f.read()
    return content

def normalize_text(file):
    all_words = word_tokenize(file)
    for word in all_words:
        word = word.lower()
        word = word.translate(str.maketrans('','',string.punctuation))

    return all_words

def get_collection_size(mydict):
    total = sum(mydict.values())
    return total

def get_vocabulary_size(mylist):
    unique_list = numpy.unique(mylist)
    vocabulary_size = len(unique_list)
    return vocabulary_size

myfile = get_input('D:\\PythonHelp\\example.txt')

total_words = normalize_text(myfile)
mydict = countElement(total_words)
print(total_words)
print(mydict)
print("Collection Size: {}".format(get_collection_size(mydict)))
print("Vocabulary Size: {}".format(get_vocabulary_size(total_words)))

And I get results like the following:

['This', '.', 'this', ',', 'Is', ',', 'is', '.', 'An', ';', 'an', ',', 'Example', '.', 'example', '!', 'Sentence', '?', 'sentence', '.']
{'This': 1, '.': 4, 'this': 1, ',': 3, 'Is': 1, 'is': 1, 'An': 1, ';': 1, 'an': 1, 'Example': 1, 'example': 1, '!': 1, 'Sentence': 1, '?': 1,
'sentence': 1}
Collection Size: 20
Vocabulary Size: 15

However, I would be expecting:

['this', 'is', 'an', 'example', 'sentence']
{'this:' 2, 'is:' 2, 'an:' 2, 'example:' 2, 'sentence:' 2}
Collection Size: 10
Vocabulary Size: 5

Why is the def normalize_text(file): that uses str.maketrans and .lower() not working properly?

When I run python --version I get 3.7.0

artemis
  • 6,857
  • 11
  • 46
  • 99
  • 2
    I'm not sure that's a relevant duplicate. The OP *is* assigning the return value of `word.translate` to `word`; that just doesn't have any affect on the *list* from which the value of `word` was taken. – chepner Sep 02 '19 at 21:51
  • I don't see how this is a duplicate. Aren't I doing that with `word = word.function`? – artemis Sep 02 '19 at 21:52
  • 4
    `word = ...` only changes what the name `word` refers to; it does not modify the list you are iterating over. (If it *did*, every `for` loop would alter its list, as the loop itself is constantly (re)assigning to the loop variable.) – chepner Sep 02 '19 at 21:53
  • `all_words = all_words.lower()` is commented out? Is that correct? – Cory Nezin Sep 02 '19 at 21:54
  • I don't understand the downvotes. I am sure this is easy to some, but not me, or else I would not have spent the time to make a question. – artemis Sep 02 '19 at 21:54
  • Sorry, @CoryNezin, I posted a test version. Let me fix. – artemis Sep 02 '19 at 21:54

2 Answers2

3

Assigning to word does not change the element of the list that was previously assigned to word; it simply changes what the name word now refers to.

You want to build a new list:

def normalize_text(file):
    # This could be defined once outside the function
    table = str.maketrans('','',string.punctuation)
    all_words = word_tokenize(file)
    return [word.lower().translate(table) for word in all_words]

Similar would be to assign directly to a list element, which is different form assigning to word.

def normalize_text(file):
    all_words = word_tokenize(file)
    for i, word in enumerate(all_words):
        word = word.lower()
        all_words[i] = word.translate(str.maketrans('','',string.punctuation))

return all_words
chepner
  • 497,756
  • 71
  • 530
  • 681
  • 1
    `AttributeError: 'list' object has no attribute 'lower` I thought `.lower()` had to be called on every individual item in a list? – artemis Sep 02 '19 at 21:58
  • I am using the second solution you have posted, by assigning directly to a list element, and this is working much better than I had it. I will look more into `enumerate`, I did not know it exist. – artemis Sep 02 '19 at 22:01
  • `enumerate` is totally overkill for this task – Matt L. Sep 02 '19 at 22:12
  • 1
    @MattL. It is. I added it mainly to show that assigning directly to `all_words[i]` is different from assigning to `word`. – chepner Sep 02 '19 at 22:21
0

The error is in the following lines of code:

for word in all_words:
    word = word.lower()
    word = word.translate(str.maketrans('','',string.punctuation))

The index variable word in this case is being temporarily created by the loop. You cannot replace it in place. See https://eli.thegreenplace.net/2015/the-scope-of-index-variables-in-pythons-for-loops/

Instead, there are two ways to loop and replace> Method 1 is to append to a new list like this:

all_words_new = []
for word in all_words:
    new_word = word.lower()
    newer_word = new_word.translate(str.maketrans('','',string.punctuation))
    all_words_new.append(newer_word)

Option 2 is a list comprehension and is a bit more advanced.

all_words_new = [word.lower() for word in all_words]
all_words_newer = [word.translate(str.maketrans('','',string.punctuation)) for word in all_words]

For more on list comprehenesions, see https://www.pythonforbeginners.com/basics/list-comprehensions-in-python

Matt L.
  • 3,431
  • 1
  • 15
  • 28