0

I need to check my string variable for presence of extended ASCII characters, one byte, decimal code 128-255. If any is there, replace it with multiple character hex equivalent, ready for further grep command etc.

Example string: "Ørsted\ Salg", I need it to be converted to "\xD8rsted\ Salg".

I know the way to do it with hastable in Bash 4:

declare -A symbolHashTable=(
    ["Ø"]="D8"
);
currSearchTerm="Ørsted\ Salg"
for curRow in "${!symbolHashTable[@]}"; do
    currSearchTerm=$(echo $currSearchTerm | sed s/$curRow/'\\x'${symbolHashTable[$curRow]}/)
done

, but that seems too tedious for 127 cases. There should be a way to do it shorter and probably faster, without writing all the symbols.

I can detect whether the string has any of the characters in it with:

echo $currSearchTerm | grep -P "[\x80-\xFF]"

I am almost sure there is a way to make sed do it, but I get lost somewhere in the "replace with" part.

uldics
  • 117
  • 1
  • 11
  • Why do you need to do this? For grep commands? grep doesn't need any of this. Are you just trying to find this text in a iso8859-15 encoded text file? – that other guy Mar 21 '18 at 20:07
  • Which Extended ASCII character set and encoding did you have in mind? Is the source in the same encoding or could it be the UTF-8 encoding of the Unicode character set? – Tom Blodget Mar 21 '18 at 20:10
  • Yes, it is grep on iso8859-1 file, can be multiple GB size. Tried with iconv, but that is probably not necessary and didnt work in script, only on hand entered command. Tried also LANG=C, but that works same as without - not matching extended. There is no UTF-8, single byte chars. Extended set, like there is extended ASCII, check on https://www.ascii-code.com/ – uldics Mar 21 '18 at 20:17
  • Okay, ISO 8859-1. (There are so many "Extended ASCII" character sets that the term often doesn't communicate what needs to be communicated.) – Tom Blodget Mar 22 '18 at 03:38
  • So far have done it by converting all characters to \x5e format with a sed based for loop and printf to concatenate it back and give the extra \x symbols. And a hacky correction for space which by some reason did not convert to \x20, but just \x0. From this post https://stackoverflow.com/a/27211176/4255834 – uldics Mar 22 '18 at 06:51

1 Answers1

2

You can easily do this with Perl:

#!/bin/bash
original='Ørsted'
replaced=$(perl -pe 's/([\x80-\xFF])/"\\x".unpack "H*", $1/eg' <<< "$original")

echo "The original variable's hex encoding is:"
od -t x1 <<< "$original"

echo "Therefore I converted $original into $replaced"

Here's the output when the file and terminal is ISO-8859-1:

The original variable's hex encoding is:
0000000 d8 72 73 74 65 64 0a
0000007
Therefore I converted Ørsted into \xd8rsted

Here's the output when the file and terminal is UTF-8:

The original variable's hex encoding is:
0000000 c3 98 72 73 74 65 64 0a
0000010
Therefore I converted Ørsted into \xc3\x98rsted

In both cases it works as expected.

that other guy
  • 116,971
  • 11
  • 170
  • 194