2

I have a file .txt and I want to travel through the words of it. I have a problem, I need to remove the punctuation marks before travelling through the words. I have tried this, but it isn't removing the punctuation marks.

file=open(file_name,"r")
for word in file.read().strip(",;.:- '").split():
     print word
file.close()
StephenTG
  • 2,579
  • 6
  • 26
  • 36
  • Do you want to remove the punctuation and then write back to the file? Also, that will strip off those characters from the beginning and end of the entire file only, not the individual words – Farhan.K Sep 27 '16 at 14:02
  • What does [**`strip`**](https://docs.python.org/3/library/stdtypes.html#str.strip) do? – Peter Wood Sep 27 '16 at 14:02
  • `split()` first, then `strip()` (at least that should get you more near to your goal) – Klaus D. Sep 27 '16 at 14:03
  • @Farhan.K I don´t want to touch the original file. I only want to get the words separately without punctuation marks –  Sep 27 '16 at 14:04
  • @KlausD. i cannot do this because strip can´t be used with lists and strip converts the string to list –  Sep 27 '16 at 14:06
  • 1
    Then you have to iterate. – Klaus D. Sep 27 '16 at 14:07
  • If you're processing English text wouldn't you want to avoid removing the punctuation in words such as 'won't'? Or are you planning to fix those up in subsequent processing? – Bill Bell Sep 28 '16 at 14:42

5 Answers5

1

The problem with your current method is that .strip() doesn't really do what you want. It removes leading and trailing characters (and you want to remove ones within the text), and if you want to specify characters in addition to whitespace, they need to be in a list.

Another problem is that there are many more potential punctuation characters (question marks, exclamations, unicode ellipses, em dashes) that wouldn't get filtered out by your list. Instead, you can use string.punctuation to get a wide range of characters (note that string.punctuation doesn't include some non-English characters, so its viability may depend on the source of your input):

import string
punctuation = set(string.punctuation)
text = ''.join(char for char in text if char not in punctuation)

An even faster method (shown in other answers on SO) uses string.translate() to replace the characters:

import string
text = text.translate(string.maketrans('', ''), string.punctuation)
Community
  • 1
  • 1
ASGM
  • 11,051
  • 1
  • 32
  • 53
1

strip()only removes characters found at the beginning or end of a string. So split() first to cut into words, then strip() to remove punctuation.

import string

with open(file_name, "rt") as finput:
    for line in finput:
        for word in line.split():
            print word.strip(string.punctuation)

Or use a natural language aware library like nltk: http://www.nltk.org/

Guillaume
  • 5,497
  • 3
  • 24
  • 42
0

You can try using the re module:

import re
with open(file_name) as f:
    for word in re.split('\W+', f.read()):
        print word

See the re documentation for more details.

Edit: In case of non ASCII characters, the previous code ignore them. In that case the following code can help:

import re
with open(file_name) as f:
    for word in re.compile('\W+', re.unicode).split(f.read().decode('utf8')):
        print word
Frodon
  • 3,684
  • 1
  • 16
  • 33
0

The following code preserves apostrophes and blanks, and could easily be modified to preserve double quotations marks, if desired. It works by using a translation table based on a subclass of the string object. I think the code is fairly easy to understand. It might be made more efficient if necessary.

class SpecialTable(str):
    def __getitem__(self, chr):
        if chr==32 or chr==39 or 48<=chr<=57 \
            or 65<=chr<=90 or 97<=chr<=122:
            return chr
        else:
            return None

specialTable = SpecialTable()


with open('temp2.txt') as inputText:
    for line in inputText:
        print (line)
        convertedLine=line.translate(specialTable)
        print (convertedLine)
        print (convertedLine.split(' '))

Here's typical output.

This! is _a_ single (i.e. 1) English sentence that won't cause any trouble, right?

This is a single ie 1 English sentence that won't cause any trouble right
['This', 'is', 'a', 'single', 'ie', '1', 'English', 'sentence', 'that', "won't", 'cause', 'any', 'trouble', 'right']
'nother one.

'nother one
["'nother", 'one']
Bill Bell
  • 21,021
  • 5
  • 43
  • 58
-1

I would remove the punctuation marks with the replace function after storing the words in a list like so:

with open(file_name,"r") as f_r:
    words = []
    for row in f_r:
        words.append(row.split())
punctuation = [',', ';', '.', ':', '-']
words = [x.replace(y, '') for y in punctuation for x in words]
Ma0
  • 15,057
  • 4
  • 35
  • 65
  • works, but one the worst possible solution in terms of memory efficiency. Also difficult to read. – Guillaume Sep 27 '16 at 14:19
  • Actually all this can be compressed in a single line. Are you seriously having problems reading this ? – Ma0 Sep 27 '16 at 14:24