Stripping punctuation from text in Python

Question

I am trying to get the tokens (words) from a textfile and strip it off all punctuation characters. I am trying the following:

import re 

with open('hw.txt') as f:
    lines_after_254 = f.readlines()[254:]
    sent = [word for line in lines_after_254 for word in line.lower().split()]
    words = re.sub('[!#?,.:";]', '', sent)

I am getting the following error:

return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer

score 2 · Answer 1 · answered Mar 02 '17 at 04:26

2

re.sub is to be applied to string and not list!

print re.sub(pattern, '', sent)

should be

print [re.sub(pattern, '', s) for s in sent]

Hope this helps!

answered Mar 02 '17 at 04:26

Keerthana Prabhakaran

3,766
1
13
23

Avinash Hindupur · Accepted Answer · 2017-03-02T04:55:20.480

A couple of things here in your script. You are not tokenizing, but splitting everything into single chars! Also, you are removing special chars after splitting everything into chars.

A better way would be to read the input string, remove the special chars and then tokenize the input string.

import re

# open the input text file and read
string = open('hw.txt').read()
print string

# remove the special charaters from the read string
no_specials_string = re.sub('[!#?,.:";]', '', string)
print no_specials_string

# split the text and store words in a list
words = no_specials_string.split()
print words

Alternatively, if you want to split into tokens first and then remove special characters, you can do this:

import re

# open the input text file and read
string = open('hw.txt').read()
print string

# split the text and store words in a list
words = string.split()
print words

# remove special characters from each word in words
new_words = [re.sub('[!#?,.:";]', '', word) for word in words]
print new_words

Clock Slave · Answer 3 · 2017-03-02T09:06:05.070

Use the remove_puncts() function below

import string
translator = str.maketrans('', '', string.punctuation)
def remove_puncts(input_string):
    return input_string.translate(translator)

Example usage

input_string = """"YH&W^(*D)#IU*DEO)#brhtr<><}{|_}vrthyb,.,''fehsvhrr;[vrht":"]`~!@#$%svbrxs"""
remove_puncts(input_string)
'YHWDIUDEObrhtrvrthybfehsvhrrvrhtsvbrxs'

EDIT

Speed Comparisons

Turns out using the translator approach is faster than substituting using regular expressions

import re, string, time

pattern = '[!#?,.:";]'
def regex_sub(input_string):
    return re.sub(pattern, '', input_string)

translator = str.maketrans('', '', string.punctuation)
def string_translator(input_string):
    return input_string.translate(translator)

input_string = """cwsx#?;.frvcdr"""
string_translator(input_string)
regex_sub(input_string)

passes = 1000000
t1 = time()
for i in range(passes):
    a = string_translator(input_string)

t2 = time()
for i in range(passes):
    a = regex_sub(input_string)

t3 = time()

string_translator_time = t2 - t1
regex_sub_time = t3 - t2

print(string_translator_time) # 1.341651439666748
print(regex_sub_time) # 3.44773268699646

aydow · Answer 4 · 2017-03-02T04:43:07.170

0

nothing is being read into your list

In [14]: with open('data', 'r') as f:
    ...:     l=f.readlines()[254:]
    ...:     

In [15]: l
Out[15]: []

assuming you want a list of words, try this

with open('data', 'r') as f:
     lines = [line.strip() for line in f]

sent= [w for word in lines[:254] for w in re.split('\s+', word)]

find = '[!#?,.:";]'
replace = ''

words = [re.sub(find, replace, word) for word in sent]

as @Keerthana Prabhakaran pointed out the re.sub has been corrected

edited Mar 02 '17 at 04:43

answered Mar 02 '17 at 04:14

aydow

3,673
2
23
40

1

This still retains the error! – Keerthana Prabhakaran Mar 02 '17 at 04:28
2

The error is `return _compile(pattern, flags).sub(repl, string, count)` and here the `sent` is a list!! – Keerthana Prabhakaran Mar 02 '17 at 04:29

Stripping punctuation from text in Python

4 Answers4