1

I am trying to get the tokens (words) from a textfile and strip it off all punctuation characters. I am trying the following:

import re 

with open('hw.txt') as f:
    lines_after_254 = f.readlines()[254:]
    sent = [word for line in lines_after_254 for word in line.lower().split()]
    words = re.sub('[!#?,.:";]', '', sent)

I am getting the following error:

return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer
tripleee
  • 175,061
  • 34
  • 275
  • 318
J Doe
  • 317
  • 1
  • 4
  • 12

4 Answers4

2

re.sub is to be applied to string and not list!

print re.sub(pattern, '', sent)

should be

print [re.sub(pattern, '', s) for s in sent]

Hope this helps!

Keerthana Prabhakaran
  • 3,766
  • 1
  • 13
  • 23
2

A couple of things here in your script. You are not tokenizing, but splitting everything into single chars! Also, you are removing special chars after splitting everything into chars.

A better way would be to read the input string, remove the special chars and then tokenize the input string.

import re

# open the input text file and read
string = open('hw.txt').read()
print string

# remove the special charaters from the read string
no_specials_string = re.sub('[!#?,.:";]', '', string)
print no_specials_string

# split the text and store words in a list
words = no_specials_string.split()
print words

Alternatively, if you want to split into tokens first and then remove special characters, you can do this:

import re

# open the input text file and read
string = open('hw.txt').read()
print string

# split the text and store words in a list
words = string.split()
print words

# remove special characters from each word in words
new_words = [re.sub('[!#?,.:";]', '', word) for word in words]
print new_words
Avinash Hindupur
  • 421
  • 5
  • 15
2

Use the remove_puncts() function below

import string
translator = str.maketrans('', '', string.punctuation)
def remove_puncts(input_string):
    return input_string.translate(translator)

Example usage

input_string = """"YH&W^(*D)#IU*DEO)#brhtr<><}{|_}vrthyb,.,''fehsvhrr;[vrht":"]`~!@#$%svbrxs"""
remove_puncts(input_string)
'YHWDIUDEObrhtrvrthybfehsvhrrvrhtsvbrxs'

EDIT

Speed Comparisons

Turns out using the translator approach is faster than substituting using regular expressions

import re, string, time

pattern = '[!#?,.:";]'
def regex_sub(input_string):
    return re.sub(pattern, '', input_string)

translator = str.maketrans('', '', string.punctuation)
def string_translator(input_string):
    return input_string.translate(translator)

input_string = """cwsx#?;.frvcdr"""
string_translator(input_string)
regex_sub(input_string)

passes = 1000000
t1 = time()
for i in range(passes):
    a = string_translator(input_string)

t2 = time()
for i in range(passes):
    a = regex_sub(input_string)

t3 = time()

string_translator_time = t2 - t1
regex_sub_time = t3 - t2

print(string_translator_time) # 1.341651439666748
print(regex_sub_time) # 3.44773268699646
Clock Slave
  • 7,627
  • 15
  • 68
  • 109
0

nothing is being read into your list

In [14]: with open('data', 'r') as f:
    ...:     l=f.readlines()[254:]
    ...:     

In [15]: l
Out[15]: []

assuming you want a list of words, try this

with open('data', 'r') as f:
     lines = [line.strip() for line in f]

sent= [w for word in lines[:254] for w in re.split('\s+', word)]

find = '[!#?,.:";]'
replace = ''

words = [re.sub(find, replace, word) for word in sent]

as @Keerthana Prabhakaran pointed out the re.sub has been corrected

aydow
  • 3,673
  • 2
  • 23
  • 40