1

I am reading a thousand line Italian text and creating a dictionary of unique words. I have tried two methods of removing the punctuation: using string

for p in string.punctuation:
     word = word.replace(p, str())

or :

for line in f:
    for word in line.split():
        stripped_text =""
        for char in word:
            if char in '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~>><<<<?>>?123456789':
               char = ''
               stripped_text += char

My problem is that this still contains punctuation:

{'<<Dicerolti': 1,'piage>>.': 1,'succia?>>.': 1,…}

Any ideas, please?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
user1478335
  • 1,769
  • 5
  • 25
  • 37
  • Sorry the returned dictionary did not come out correctly: {'<>.': 1, 'Nacqui': 1, 'angelo': 1, 'condotta.': 1, 'i': 258, 'voi': 91, 'digiunto.': 1, 'quei:': 1, 'porta.': 2, 'porta,': 5, 'via.': 2, 'consorto': 1, 'via,': 14, 'fosca,': 1, 'vince': 10, 'Lancialotto': 1, 'fosca!': 1, 'vinci': 2, 'voi?>>;': 1, – user1478335 Nov 07 '13 at 15:03
  • 1
    You can [edit] your question to update information. – Martijn Pieters Nov 07 '13 at 15:11
  • http://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python – jbat100 Nov 07 '13 at 15:36
  • Thank you for this. I have looked at the solutions in your reference, but I am somewhat lost. I am wondering whether the specific punctuation that is not removed is 'peculiar' to Italian, particularly << and >>. These replace " and " in English. I tried word.translate(None, string.punctuation), but get a Type Error. TAkes one argument, two given. Also in the dictionary above porta returns four times , once porta; and then porta:, porta. and porta,. So my argument falls away rather. Need more help if possible, please – user1478335 Nov 07 '13 at 16:48

1 Answers1

1

You could use the re module for this and a little printf style trick to build a regex that flags any punctuation for replacement.

import string
import re
a = '>>some_crazy_string..!'
print re.sub('[%s]' % string.punctuation,'',a)

prints out

somecrazystring

I've used this trick a couple of times for 'anonymizing' log files.

synthesizerpatel
  • 27,321
  • 5
  • 74
  • 91