0

I am desperately trying to replace certain unicode characters (graphemes) from a file using sed. However I keep failing for some of them, namely the ones from unicode blocks:

\p{InHigh_Surrogates}: U+D800–U+DB7F
\p{InHigh_Private_Use_Surrogates}: U+DB80–U+DBFF
\p{InLow_Surrogates}: U+DC00–U+DFFF

I tried (in a sed config file loaded via the -f switch):

s/\p{InHigh_Surrogates}/###/  --> no effect at all
s/\\p\{InHigh_Surrogates\}/###_D-NON-UTF8_###/ -> error message 'Invalid content of \{\}'

Anybody got a suggestion? Also, I am not necessarily focused on using the blocks - but I also failed trying to define a character range of the form \xd800-\xdfff.

Thanks, Thomas

DrTH
  • 1
  • 3

1 Answers1

2

Try using the -r flag for sed:

$ sed -r 's/\\p\{InHigh_Surrogates\}/###/g' file
###: U+D800–U+DB7F
\p{InHigh_Private_Use_Surrogates}: U+DB80–U+DBFF
\p{InLow_Surrogates}: U+DC00–U+DFFF

From man sed:

-r, --regexp-extended

use extended regular expressions in the script.

fedorqui
  • 275,237
  • 103
  • 548
  • 598
  • Thanks! Itried that, required changing some other lines as well - but still InHigh_Surrogates seems to be the problem... – DrTH Mar 17 '14 at 12:34
  • 1
    But is it working to you or not? If not, please update your question with the exact problem you are facing. If it does, note you can mark the answer as accepted. – fedorqui Mar 17 '14 at 12:51
  • Sorry for being imprecise - no, it did not work using `-r` either. Seems to me like SED does not know about unicode blocks - or I am too dumb to make it work ;) I cannot give any clearer explanation than the one provided. In both ways, I get the same error message described in my initial posting. – DrTH Mar 18 '14 at 16:18
  • 1
    I am sorry to say I don't know what else could it be. You can try checking in this site possible options. For example, [Remove unicode characters from textfiles - sed , other bash/shell methods](http://stackoverflow.com/q/8562354/1983854) – fedorqui Mar 18 '14 at 16:20