How do I get rid of this unicode character?

Question

Any idea how to get rid of this irritating character U+0092 from a bunch of text files? I've tried all the below but it doesn't work. It's called U+0092+control from the character map

sed -i 's/\xc2\x92//' *
sed -i 's/\u0092//' *
sed -i 's///' *

Ah, I've found a way:

CHARS=$(python2 -c 'print u"\u0092".encode("utf8")')
sed 's/['"$CHARS"']//g'

But is there a direct sed method for this?

Single quotes will stop your shell from parsing any escaped notation for backtick. I'm not sure sed would do this itself, so maybe try double quotes? — Ulrich Schwarz, Dec 20 '11 at 06:43
this guy is tricky. it's some sort of non-space breaks, it's U+0092 that appears in the txt but not visible. — alvas, Dec 20 '11 at 07:13
U+0092 is a never-used control character. It is almost always the result of misdecoding a single right quote `’` in a Windows code page 1252 file as ISO-8859-1. The encodings are very similar but the characters encoded in the byte range 0x80–0x9F are different. In this case you shouldn't get rid of it or the other smart quote characters, you should just read them correctly as ISO-8859-1, or transcode the file from 1252 to 8859-1 or UTF-8. — bobince, Dec 21 '11 at 21:06

score 3 · Accepted Answer · edited Jul 18 '19 at 06:01

3

Try sed "s/\`//g" *. (I added the g so it will remove all the backticks it finds).

EDIT: It's not a backtick that OP wants to remove.

Following the solution in this question, this ought to work:

sed 's/\xc2\x92//g'

To demonstrate it does:

$ CHARS=$(python -c 'print u"asdf\u0092asdf".encode("utf8")')

$ echo $CHARS
asdf<funny glyph symbol>asdf

$ echo $CHARS | sed 's/\xc2\x92//g'
asdfasdf

Seeing as it's something you tried already, perhaps what is in your text file is not U+0092?

edited Jul 18 '19 at 06:01

Pablo Bianchi

1,824
1
26
30

answered Dec 20 '11 at 06:56

mathematical.coffee

55,977
11
154
194

Ahh, I see. In that case, have a look at this solution: http://stackoverflow.com/questions/8562354/remove-unicode-characters-from-textfiles-sed-other-bash-shell-methods/8562661#8562661 – mathematical.coffee Dec 20 '11 at 07:25
it's a weird thing. `sed 's/\xc2\x92//g'` didn't work but `CHARS=$(python -c 'print u"\u0092".encode("utf8")') sed 's/['"$CHARS"']//g'` works fine. Since u0092 and \xc2\x92 should be the same character, I'm not sure why one works but the other does. – alvas Dec 20 '11 at 07:44
That's curious, if you ever figure out why I'm interested to know! – mathematical.coffee Dec 20 '11 at 23:11

score 1 · Answer 2 · answered Dec 20 '11 at 11:02

This might work for you (GNU sed):

echo "string containing funny character(s)" | sed -n 'l0'

This will display the string as sed sees it in octal, then use:

echo "string containing funny character(s)" | sed 's/\onnn//g'

Where nnn is the octal value, to delete it/them.

How do I get rid of this unicode character?

2 Answers2

Linked