sed: matching unicode blocks with

Question

I am desperately trying to replace certain unicode characters (graphemes) from a file using sed. However I keep failing for some of them, namely the ones from unicode blocks:

\p{InHigh_Surrogates}: U+D800–U+DB7F
\p{InHigh_Private_Use_Surrogates}: U+DB80–U+DBFF
\p{InLow_Surrogates}: U+DC00–U+DFFF

I tried (in a sed config file loaded via the -f switch):

s/\p{InHigh_Surrogates}/###/  --> no effect at all
s/\\p\{InHigh_Surrogates\}/###_D-NON-UTF8_###/ -> error message 'Invalid content of \{\}'

Anybody got a suggestion? Also, I am not necessarily focused on using the blocks - but I also failed trying to define a character range of the form \xd800-\xdfff.

Thanks, Thomas

The reason might be that surrogates are invalid in UTF-8. – nwellnhof Mar 17 '14 at 15:08 — nwellnhof, Mar 17 '14 at 15:08

score 2 · Answer 1 · answered Mar 17 '14 at 09:27

2

Try using the -r flag for sed:

$ sed -r 's/\\p\{InHigh_Surrogates\}/###/g' file
###: U+D800–U+DB7F
\p{InHigh_Private_Use_Surrogates}: U+DB80–U+DBFF
\p{InLow_Surrogates}: U+DC00–U+DFFF

From man sed:

-r, --regexp-extended

use extended regular expressions in the script.

answered Mar 17 '14 at 09:27

fedorqui

275,237
103
548
598

Thanks! Itried that, required changing some other lines as well - but still InHigh_Surrogates seems to be the problem... – DrTH Mar 17 '14 at 12:34
1

But is it working to you or not? If not, please update your question with the exact problem you are facing. If it does, note you can mark the answer as accepted. – fedorqui Mar 17 '14 at 12:51
Sorry for being imprecise - no, it did not work using `-r` either. Seems to me like SED does not know about unicode blocks - or I am too dumb to make it work ;) I cannot give any clearer explanation than the one provided. In both ways, I get the same error message described in my initial posting. – DrTH Mar 18 '14 at 16:18
1

I am sorry to say I don't know what else could it be. You can try checking in this site possible options. For example, [Remove unicode characters from textfiles - sed , other bash/shell methods](http://stackoverflow.com/q/8562354/1983854) – fedorqui Mar 18 '14 at 16:20

sed: matching unicode blocks with

1 Answers1

Linked