2

I am removing control characters from a string as I load and deserialise it. I do this with the following regex, which is fine:

\\p{C}

The issue is part of the text is meant to have new lines in it. So what I need to do is remove all control characters unless they fall between <Text> and </Text>.

How can do I do this with a regex?

Robin
  • 9,415
  • 3
  • 34
  • 45
  • Not easily; you should consider a more sophisticated solution; I happen to have [a project which could help you there](https://github.com/parboiled1/grappa) – fge May 15 '14 at 09:17
  • Or else, well, your input seems to be XML so why not use a streaming XML parser API? – fge May 15 '14 at 09:18

3 Answers3

3

You could use

replaceAll("(?s)(<Text>.*?</Text>)|\\p{C}", "$1")

The idea is to skip Text tags contents and leave them alone (replace them with themselves). So if we encounter a \\p{C}, we know it's not inside one.

Explanation:

  • (?s) activates "dot match all", so . will match newline as well
  • (<Text>.*?</Text>) captures the text node in the first group. We replace with the result of this capture through $1
  • If we match \\p{C}, this means we are not in a Text node. So we replace with $1, which is empty since (<Text>.*?</Text>) didn't match in the alternation.

Ideone illustration: http://ideone.com/xKZgsn

Robin
  • 9,415
  • 3
  • 34
  • 45
  • As an optional minor tweak, `[^<]*<` is more efficient than `.*?<` – zx81 May 15 '14 at 09:32
  • @zx81: I was going to link to the [Match a pattern except in three situations s1, s2, s3](http://stackoverflow.com/questions/23589174/match-a-pattern-except-in-three-situations-s1-s2-s3) answer, then I realized you wrote it :) – Robin May 15 '14 at 09:32
  • I think this is removing both the control characters and the entire Text tags. Have I misunderstood how you are suggesting I use this? – David Kibblewhite May 15 '14 at 09:32
  • @zx81: It is, but I don't know whether there are other tags nested inside the `` one. – Robin May 15 '14 at 09:33
  • Ha, small world, and yeah, good point!... Scrap that thought. :) – zx81 May 15 '14 at 09:35
  • I think I understand what you're saying and based on that it seems like it should work, but when I actually use it on my string it removes bothe the control characters AND the text tags and everything inside. – David Kibblewhite May 15 '14 at 09:46
  • 1
    @DavidKibblewhite Did you replace with `$1`? [See example with digits](http://fiddle.re/ebkcp) (click on "Java" -> replaceAll section) works fine for me. – Jonny 5 May 15 '14 at 09:56
  • Ah I see what's happening now. Your regex replaces any control characters with "null", rather than "", which is what it really needs to be. – David Kibblewhite May 15 '14 at 10:03
  • Accepted your answer, since anything still wrong is a separate issue for me to look at. – David Kibblewhite May 15 '14 at 10:06
0

You could use this regex :

/(?!<text[^>]*?>)(\p{C}+)(?![^<]*?<\/text>)/gi

But, as mentioned by @fge, would be better to cleanly parse your input.

zessx
  • 68,042
  • 28
  • 135
  • 158
0

Here is a string I have to test regex patterns that remove control characters.

AAU?Aasddsaustw3h,kdf134dfswdesdfent?�sdfsadfa45678r?w3h,kdf134dfswdesdfawh,kdf134dfswdesdfsurew3h,kdf134dfswdesdfent??3asdfliit/123423defwecty ?�STasd?Pawh,kdf134dfswdesdfks?Hw3rsdfsd134dfswdet

It seems regex pattern "[[:cntrl:]]" works well. string.replaceAll("[\u0000-\u001f]", "") just replace part of them. "\p{Cntrl}" just replace empty string after "wecty".

Can anyone told me what's those control characters are? I can replace them but could not figure out what are they. The jave online regex test show there are 11 control characters matched. https://www.freeformatter.com/java-regex-tester.html#ad-output

Decula
  • 494
  • 4
  • 16