0

Made a function to count 20 most common words in a book that I downloaded as a plain text format. The python textbook I am going off of said to use the import string and then the replace or the translate method to remove any punctuation, but when I print out the lines after the replace step, all the lines still have punctuation in it. I tried moving around the line = line.strip() and the line = line.replace(string.punctuation,'') step, but that did not work. I have never used replace so I may be using it wrong for all I know. Rest of my program works, just that step is frustrating me.

import string
def function():
    infile = open('gutbook.txt','r',encoding='utf-8')
    count = dict()
    list2 = list()
    for line in infile:
        line = line.strip()
        line = line.replace(string.punctuation,'')
        line = line.lower().split()
        if line== []:
            continue
        for i in line:
            count[i] = count.get(i,0) + 1
    for key,value in count.items():
        newtuple = (value,key)
        list2.append(newtuple)
    list3 = sorted(list2,reverse = True)
    print(list3[:20])



function()
John
  • 1,210
  • 5
  • 23
  • 51
Matt
  • 41
  • 5

1 Answers1

0

Use Regex.

Ex:

import re
import string

text = "Hello ! #$%&'()*+,-./:;<=>?@[\]^_`{|}~ World"
print(re.sub("[" + re.escape(string.punctuation) + "]", "", text))
#or
print( re.sub(r'[^a-zA-Z0-9\s]','',text) )
Rakesh
  • 81,458
  • 17
  • 76
  • 113
  • 1
    You've got a subtle bug here, that can be fixed by wrapping `string.punctuation` in [`re.escape`](https://docs.python.org/3/library/re.html#re.escape), e.g. `re.sub("[" +re.escape(string.punctuation) + "]", "", text)`. Without escaping it, it won't treat `\ ` as punctuation (it interprets the `\ ` as escaping the `]` in `string.punctuation`, which prevents everything exploding, but also omits `\ ` from the set of characters to match). – ShadowRanger Jun 01 '18 at 21:25
  • @ShadowRanger. Thank you so much. – Rakesh Jun 01 '18 at 21:28