In Bash, how to convert only extended ASCII chars to their hex codes?

Question

I need to check my string variable for presence of extended ASCII characters, one byte, decimal code 128-255. If any is there, replace it with multiple character hex equivalent, ready for further grep command etc.

Example string: "Ørsted\ Salg", I need it to be converted to "\xD8rsted\ Salg".

I know the way to do it with hastable in Bash 4:

declare -A symbolHashTable=(
    ["Ø"]="D8"
);
currSearchTerm="Ørsted\ Salg"
for curRow in "${!symbolHashTable[@]}"; do
    currSearchTerm=$(echo $currSearchTerm | sed s/$curRow/'\\x'${symbolHashTable[$curRow]}/)
done

, but that seems too tedious for 127 cases. There should be a way to do it shorter and probably faster, without writing all the symbols.

I can detect whether the string has any of the characters in it with:

echo $currSearchTerm | grep -P "[\x80-\xFF]"

I am almost sure there is a way to make sed do it, but I get lost somewhere in the "replace with" part.

Why do you need to do this? For grep commands? grep doesn't need any of this. Are you just trying to find this text in a iso8859-15 encoded text file? — that other guy, Mar 21 '18 at 20:07
Which Extended ASCII character set and encoding did you have in mind? Is the source in the same encoding or could it be the UTF-8 encoding of the Unicode character set? — Tom Blodget, Mar 21 '18 at 20:10
Yes, it is grep on iso8859-1 file, can be multiple GB size. Tried with iconv, but that is probably not necessary and didnt work in script, only on hand entered command. Tried also LANG=C, but that works same as without - not matching extended. There is no UTF-8, single byte chars. Extended set, like there is extended ASCII, check on https://www.ascii-code.com/ — uldics, Mar 21 '18 at 20:17
Okay, ISO 8859-1. (There are so many "Extended ASCII" character sets that the term often doesn't communicate what needs to be communicated.) — Tom Blodget, Mar 22 '18 at 03:38
So far have done it by converting all characters to \x5e format with a sed based for loop and printf to concatenate it back and give the extra \x symbols. And a hacky correction for space which by some reason did not convert to \x20, but just \x0. From this post https://stackoverflow.com/a/27211176/4255834 — uldics, Mar 22 '18 at 06:51

score 2 · Answer 1 · answered Mar 21 '18 at 20:55

You can easily do this with Perl:

#!/bin/bash
original='Ørsted'
replaced=$(perl -pe 's/([\x80-\xFF])/"\\x".unpack "H*", $1/eg' <<< "$original")

echo "The original variable's hex encoding is:"
od -t x1 <<< "$original"

echo "Therefore I converted $original into $replaced"

Here's the output when the file and terminal is ISO-8859-1:

The original variable's hex encoding is:
0000000 d8 72 73 74 65 64 0a
0000007
Therefore I converted Ørsted into \xd8rsted

Here's the output when the file and terminal is UTF-8:

The original variable's hex encoding is:
0000000 c3 98 72 73 74 65 64 0a
0000010
Therefore I converted Ørsted into \xc3\x98rsted

In both cases it works as expected.

In Bash, how to convert only extended ASCII chars to their hex codes?

1 Answers1