4

Any idea how to get rid of this irritating character U+0092 from a bunch of text files? I've tried all the below but it doesn't work. It's called U+0092+control from the character map

sed -i 's/\xc2\x92//' *
sed -i 's/\u0092//' *
sed -i 's///' *

Ah, I've found a way:

CHARS=$(python2 -c 'print u"\u0092".encode("utf8")')
sed 's/['"$CHARS"']//g'

But is there a direct sed method for this?

MightyPork
  • 18,270
  • 10
  • 79
  • 133
alvas
  • 115,346
  • 109
  • 446
  • 738
  • 1
    Single quotes will stop your shell from parsing any escaped notation for backtick. I'm not sure sed would do this itself, so maybe try double quotes? – Ulrich Schwarz Dec 20 '11 at 06:43
  • this guy is tricky. it's some sort of non-space breaks, it's U+0092 that appears in the txt but not visible. – alvas Dec 20 '11 at 07:13
  • 3
    U+0092 is a never-used control character. It is almost always the result of misdecoding a single right quote `’` in a Windows code page 1252 file as ISO-8859-1. The encodings are very similar but the characters encoded in the byte range 0x80–0x9F are different. In this case you shouldn't get rid of it or the other smart quote characters, you should just read them correctly as ISO-8859-1, or transcode the file from 1252 to 8859-1 or UTF-8. – bobince Dec 21 '11 at 21:06

2 Answers2

3

Try sed "s/\`//g" *. (I added the g so it will remove all the backticks it finds).


EDIT: It's not a backtick that OP wants to remove.

Following the solution in this question, this ought to work:

sed 's/\xc2\x92//g'

To demonstrate it does:

$ CHARS=$(python -c 'print u"asdf\u0092asdf".encode("utf8")')

$ echo $CHARS
asdf<funny glyph symbol>asdf

$ echo $CHARS | sed 's/\xc2\x92//g'
asdfasdf

Seeing as it's something you tried already, perhaps what is in your text file is not U+0092?

Pablo Bianchi
  • 1,824
  • 1
  • 26
  • 30
mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194
  • Ahh, I see. In that case, have a look at this solution: http://stackoverflow.com/questions/8562354/remove-unicode-characters-from-textfiles-sed-other-bash-shell-methods/8562661#8562661 – mathematical.coffee Dec 20 '11 at 07:25
  • it's a weird thing. `sed 's/\xc2\x92//g'` didn't work but `CHARS=$(python -c 'print u"\u0092".encode("utf8")') sed 's/['"$CHARS"']//g'` works fine. Since u0092 and \xc2\x92 should be the same character, I'm not sure why one works but the other does. – alvas Dec 20 '11 at 07:44
  • That's curious, if you ever figure out why I'm interested to know! – mathematical.coffee Dec 20 '11 at 23:11
1

This might work for you (GNU sed):

echo "string containing funny character(s)" | sed -n 'l0'

This will display the string as sed sees it in octal, then use:

echo "string containing funny character(s)" | sed 's/\onnn//g'

Where nnn is the octal value, to delete it/them.

potong
  • 55,640
  • 6
  • 51
  • 83