1

So while helping someone debug some code I realized that there were some weird characters in their output, namely � and �(\xc0 and \xd0 in hex).

I wanted to find these characters in a large text output file.

I've managed to locate these characters using sublime by enabling the regex option in find with \xc0 or \xd0 being the query. I have also managed to grep them by doing grep $'\xc0' filename in bash.

The thing that is bothering me right now is that, if I use the -P option for grep, it refuses to find these characters.

grep -P "\xc0" filename would print out nothing for a file that has that character in it(and the other two methods above would successfully find it), and this is bugging me so badly I want to know why this wouldn't work.

I have read a couple of other posts in which the -P option along with "[\x80-\xff]" are suggested but for some reason I just couldn't get them to work :\

grep -P has been a good friend for a long time until now :( Any help and tips are appreciated!

I'm using GNU grep.

EDIT:

I have actually tried on 2 linux distributions.

  • On Ubuntu 14.04 with bash: My terminal doesn't seem to like the character :\

printf "\xc0" prints out nothing in the terminal, however printing it to a file with > and then opening in sublime would show the character.

printf "\xc0" > foo
grep $'\xc0' foo > out1
grep -P '\xc0' foo > out2
grep -P '\x{c0}' foo > out3

out{1,2,3} are all empty.

  • On CentOS 7.2 with bash: printf prints something - the question mark dark thingy

printf "\xc0" prints out �(actually looks like this)

printf "\xc0" > foo
grep $'\xc0' foo > out1
grep -P '\xc0' foo > out2
grep -P '\x{c0}' foo > out3

Only out1 contains the character.

a283626086
  • 31
  • 1
  • 7

1 Answers1

1

byte

What you need to do first is to create inside a variable the exact byte that you want to search.

Something like any of this:

a=$(echo -e '\xc0)
a=$'\xc0'
a=$(printf '\xc0')
a=$(echo -e '\300')     # 300 is 0xC0 in octal
a=$'\300'
a=$(printf '\300')
a=$(echo "c0" | xxd -r -p)

I could try to come up with some other ways, but I hope you get the idea.

Then, you could try to search for the byte with grep:

echo $'Testing this: \xC0 byte' |  grep "$a"

And, if you use a locale with utf-8 (as is the most common) that will fail. If you change to a ISO-8859-1 locale, that will work:

LC_ALL=en_US.iso88591 echo $'Testing this: \xC0 byte' |
LC_ALL=en_US.iso88591  grep -P "$a"

Or, if you don't mind starting a new bash instance:

$ bash
$ export LC_ALL=en_US.iso88591
$ echo $'Testing this: \xC0 byte' |  grep -P "$a"

And just return to the old bash environment by executing exit.
This might work or not depending on your system.

Let's explore the other side: characters.

character

There is a very very important twist that you should understand.
A byte is not a character. Well, sometimes, by sheer luck, it is.

But beside those 128 ASCII characters in which a byte is a character (not in UTF-16 or UTF-32. And let's also forget about EBCDIC), all 1,114,112 (17 × 65,536) UNICODE code points have more than one byte 1.

In that case, you should ask for the UNICODE code point of hex 0xC0.
In modern bash, like this:

$ printf '\U00C0`
À

Which is this character: LATIN CAPITAL LETTER A WITH GRAVE

That will be encoded as one byte if the locale is ISO-8859-1 (and ISO-8859-15, at least) and as two bytes if the locale is utf-8.

$ a=$(printf '\UC0')
$ printf 'Testing \U00C0 character' | grep -P "$a"
Testing À character

It also will work if you change the LC_ALL variable. Well, I mean that grep will detect the character, but the printed line may fail to render correctly the character due to the changed locale.

If the file has this character and the encoding of the file is correct. Grep will work with the value of the character in a variable.

  • Thank you so much for the explanation! I always got confused with character encodings and all this. On my Ubuntu machine `printf '\U00c0'` indeed prints that character to my terminal now. So `printf '\xc0'` not displaying was because of the locale of the shell(I was able to see it in sublime if I printed to file)? – a283626086 Nov 27 '16 at 19:31
  • One of the main reasons I was confused was that I could use the `\xc0` regex expression to search for the character in sublime, while I can't seem to do it with `grep`'s `-P` option, as seen in many other posts I have seen, for example [this one](http://stackoverflow.com/questions/23695609/how-to-grep-for-presence-of-specific-hex-bytes-in-files) and [this one](http://unix.stackexchange.com/questions/19491/how-to-specify-characters-using-hexadecimal-codes-in-grep). – a283626086 Nov 27 '16 at 19:33
  • @a283626086 You can use `\xc0` in sublime because it is **assuming** a specific code page (only 256 characters), probably ISO-8859-1 (in USA) or ISO-8859-5 (in Russia) or ISO-8859-7 (Greece). In that limited charset, the byte C0 means a specific character À, Р or ΐ (respectively for the code pages above). But that also means that the character used could change when a code page is changed. That sublime choose one character set is just a limitation it has. UTF-8 breaks that limit. Embrace UTF-8 and be free to write any character. –  Nov 30 '16 at 00:59