14

Sometimes I have a strings with strange characters. They are not visible in browser, but are part of the string and are counted in len(). How can I get rid of it? Strip() deletes normal space but not that signs.

robos85
  • 2,484
  • 5
  • 32
  • 36
  • See this solution: http://stackoverflow.com/questions/92438/stripping-non-printable-characters-from-a-string-in-python – JJ. Aug 22 '11 at 12:28

4 Answers4

17

Use the character categories from the string module. If you want to allow all printable characters, you can do

from string import printable
new_string = ''.join(char for char in the_string if char in printable)

Building on YOU's answer, you can do this with re.sub too:

new_string = re.sub("[^{}]+".format(printable), "", the_string)

Also, if you want to see all the characters in a string, even the unprintable ones, you can always do

print repr(the_string)

which will show things like \x00 for unprintable characters.

agf
  • 171,228
  • 44
  • 289
  • 238
15

You can filter your string using str.isprintable() (from PEP-3138):

output_str = ''.join(c for c in input_str if c.isprintable())
  • Very easy to implement even now some years later. If you want to collect those invisible elements within the string, so, you just: `[c for c in _x if not c.isprintable()]`. In my case, I get only the invisible ones, so, you can do some "hot in code" and do whatever you want. – Bitart Dec 09 '21 at 17:56
6

Collect set of chars that you want to enable and remove the rest like this

import re
text = re.sub("[^a-z0-9]+","", text, flags=re.IGNORECASE)

it will remove any characters other than a to z, A to Z and 0 to 9.

YOU
  • 120,166
  • 34
  • 186
  • 219
  • I need full utf8 signs set :/ – robos85 Aug 22 '11 at 12:35
  • @robos85, you need some info to strip or not to strip. so can I assume you need to strip all invalid chars for utf8? there is a solution for that, but which might includes unvisible/non-printable characters. – YOU Aug 22 '11 at 12:45
1

Regular expressions are a good and very universal tool for all kinds of string analysis. If speed is an issue, the "translate" method from the string class can help you too.

First you define a ('identity') mapping, which will not change anything:

mapping = map(chr, range(256))

if you want to replace each "a" by a "b", you modify your mapping

mapping[ord('a')] = 'b'

Now you build the table for the "translate" method:

table = "".join(mapping)

and

print "abc".translate(table)

prints "bbc".

If you really want to delete the "a", you do not modify the mapping above, build the table and then call translate as follows:

print "abc".translate(table, "a")

gives you "bc".

Once the table is built, the translate method is very fast.

So in your case you can modify the mapping such that all your unwanted characters are mapped to a whitespace

mapping = map(chr, range(256))
table = "".join( " " if c in unwanted_chars else c for c in map(chr, range(256)) )

and use len("my string".translate(table).trim()) which ignores the unwanted characters at the beginning and the end of the string.

Or you use len("my string".translate(table, unwanted_chars)) which will ignore all you unwanted characters.

rocksportrocker
  • 7,251
  • 2
  • 31
  • 48
  • Nice. +1 tomorrow when I have votes again. I thought about translate but was too lazy to look up the syntax. – agf Aug 22 '11 at 15:02