0

I need to delete the following punctuation characters and entities in a text document.

  1. Delete &#151, &#148, &#some number
  2. ; , . ( ) [ ] * ! !
  3. &nbsp

I know that I can use this to delete &#some number and &nbsp. However, as a beginner, I don't know if I can do the same thing to delete the other things like ;, , etc.

match = re.sub(r'&#146', '', open('test2.txt', 'r').read())

Also, is there any way that I can delete all of them at once rather than running the same code so many times.

Kev
  • 118,037
  • 53
  • 300
  • 385
Jimmy
  • 161
  • 3
  • 13
  • related: [Best way to strip punctuation from a string in Python](http://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python) – jfs Aug 28 '12 at 04:20

2 Answers2

0

Those look like HTML and URL encoded entities.

You could just decode them using a variety of means

Jonathan Vanasco
  • 15,111
  • 10
  • 48
  • 72
  • Thanks. However, is there any way to delete , * ! . at once? – Jimmy Aug 28 '12 at 04:06
  • How about `re.sub(r"[][*!.();,]", "", your_string)`? Or, to take a slightly different approach, try matching everything but the characters you do want (such as letters, numbers and spaces): `re.sub(r"[^A-Za-z0-9 ]", "", your_string)`. – Blckknght Aug 28 '12 at 04:20
0

If you already have everything in a string, you can simply use translate():

>>> s
"hello there ! this is a string with $ some % characters I & don't ( want!"
>>> s.translate(None,"$!%&(")
"hello there  this is a string with  some  characters I  don't  want"
Burhan Khalid
  • 169,990
  • 18
  • 245
  • 284