6

I've got a String containing text, control characters, digits, umlauts (german) and other utf8 characters.

I want to strip all utf8 characters which are not "part of the language". Special characters like (non complete list) ":/\ßä,;\n \t" should all be preserved.

Sadly stackoverflow removes all those characters so I have to append a picture (link).

Any ideas? Help is very appreciated!

PS: If anybody does know a pasting service which does not kill those special characters I would happily upload the strings.. I just wasn't able to find one..

[Edit]: I THINK the regex "\P{Cc}" are all characters I want to PRESERVE. Could this regex be inverted so all characters not matching this regex be returned?

friesoft
  • 155
  • 2
  • 2
  • 7
  • Not sure, but it is possible that inverted version of \P{something} can be \p{something}. If not you can try with [^\P{something}]. – Pshemo Mar 20 '13 at 10:31
  • @Pshemo yeah indead lowercase seems to work as has been posted below.. thanks! – friesoft Mar 20 '13 at 10:32
  • Possible duplicate of [Fastest way to strip all non-printable characters from a Java String](http://stackoverflow.com/questions/7161534/fastest-way-to-strip-all-non-printable-characters-from-a-java-string) – Stewart Oct 14 '16 at 17:34

2 Answers2

9

You have already found Unicode character properties.

You can invert the character property, by changing the case of the leading "p"

e.g.

\p{L} matches all letters

\P{L} matches all characters that does not have the property letter.

So if you think \P{Cc} is what you need, then \p{Cc} would match the opposite.

More details on regular-expressions.info

I am quite sure \p{Cc} is close to what you want, but be careful, it does include, e.g. the tab (0x09), the Linefeed (0x0A) and the Carriage return (0x0D).

But you can create you own character class, like this:

[^\P{Cc}\t\r\n]

This class [^...] is a negated character class, so this would match everything that is not "Not control character" (double negation, so it matches control chars), and not tab, CR and LF.

stema
  • 90,351
  • 20
  • 107
  • 135
  • Very nice! Thanks I didn't know that.. guess I really have to read up more regex tutorials... – friesoft Mar 20 '13 at 10:31
  • @friesoft The linebreak regular expression is `\r|\n|\r\n`, so `\p{Cc}|\r|\n|\r\n` should suit your needs. – sp00m Mar 20 '13 at 10:37
0

You can use,

your_string.replaceAll("\\p{C}", "");
Jayamohan
  • 12,734
  • 2
  • 27
  • 41