Regex to remove control characters except in a certain tag

Question

I am removing control characters from a string as I load and deserialise it. I do this with the following regex, which is fine:

\\p{C}

The issue is part of the text is meant to have new lines in it. So what I need to do is remove all control characters unless they fall between <Text> and </Text>.

How can do I do this with a regex?

Not easily; you should consider a more sophisticated solution; I happen to have [a project which could help you there](https://github.com/parboiled1/grappa) — fge, May 15 '14 at 09:17
Or else, well, your input seems to be XML so why not use a streaming XML parser API? — fge, May 15 '14 at 09:18

Robin · Accepted Answer · 2014-05-15T10:15:14.703

3

You could use

replaceAll("(?s)(<Text>.*?</Text>)|\\p{C}", "$1")

The idea is to skip Text tags contents and leave them alone (replace them with themselves). So if we encounter a \\p{C}, we know it's not inside one.

Explanation:

(?s) activates "dot match all", so . will match newline as well
(<Text>.*?</Text>) captures the text node in the first group. We replace with the result of this capture through $1
If we match \\p{C}, this means we are not in a Text node. So we replace with $1, which is empty since (<Text>.*?</Text>) didn't match in the alternation.

Ideone illustration: http://ideone.com/xKZgsn

edited May 15 '14 at 10:15

answered May 15 '14 at 09:21

Robin

9,415
3
34
45

As an optional minor tweak, `[^<]*<` is more efficient than `.*?<` – zx81 May 15 '14 at 09:32
@zx81: I was going to link to the [Match a pattern except in three situations s1, s2, s3](http://stackoverflow.com/questions/23589174/match-a-pattern-except-in-three-situations-s1-s2-s3) answer, then I realized you wrote it :) – Robin May 15 '14 at 09:32
I think this is removing both the control characters and the entire Text tags. Have I misunderstood how you are suggesting I use this? – David Kibblewhite May 15 '14 at 09:32
@zx81: It is, but I don't know whether there are other tags nested inside the `` one. – Robin May 15 '14 at 09:33
Ha, small world, and yeah, good point!... Scrap that thought. :) – zx81 May 15 '14 at 09:35
I think I understand what you're saying and based on that it seems like it should work, but when I actually use it on my string it removes bothe the control characters AND the text tags and everything inside. – David Kibblewhite May 15 '14 at 09:46
1

@DavidKibblewhite Did you replace with `$1`? [See example with digits](http://fiddle.re/ebkcp) (click on "Java" -> replaceAll section) works fine for me. – Jonny 5 May 15 '14 at 09:56
Ah I see what's happening now. Your regex replaces any control characters with "null", rather than "", which is what it really needs to be. – David Kibblewhite May 15 '14 at 10:03
Accepted your answer, since anything still wrong is a separate issue for me to look at. – David Kibblewhite May 15 '14 at 10:06

score 0 · Answer 2 · answered May 15 '14 at 09:30

0

You could use this regex :

/(?!<text[^>]*?>)(\p{C}+)(?![^<]*?<\/text>)/gi

But, as mentioned by @fge, would be better to cleanly parse your input.

answered May 15 '14 at 09:30

zessx

68,042
28
135
158

Decula · Answer 3 · 2018-08-28T19:33:58.430

Here is a string I have to test regex patterns that remove control characters.

AAU?Aasddsaustw3h,kdf134dfswdesdfent?�sdfsadfa45678r?w3h,kdf134dfswdesdfawh,kdf134dfswdesdfsurew3h,kdf134dfswdesdfent??3asdfliit/123423defwecty ?�STasd?Pawh,kdf134dfswdesdfks?Hw3rsdfsd134dfswdet

It seems regex pattern "[[:cntrl:]]" works well. string.replaceAll("[\u0000-\u001f]", "") just replace part of them. "\p{Cntrl}" just replace empty string after "wecty".

Can anyone told me what's those control characters are? I can replace them but could not figure out what are they. The jave online regex test show there are 11 control characters matched. https://www.freeformatter.com/java-regex-tester.html#ad-output

Regex to remove control characters except in a certain tag

3 Answers3