0

I want to convert windows UTF8 file containing a special apostrophe to unix ISO-8859-1 file. This is how I am doing it :

# -- unix file
tr -d '\015' < my_utf8_file.xml > t_my_utf8_file.xml

# -- get rid of special apostrophe
sed "s/’/'/g" t_my_utf8_file.xml > temp_my_utf8_file.xml

#  -- change the xml header
sed "s/UTF-8/ISO-8859-1/g" temp_my_utf8_file.xml > my_utf8_file_temp.xml

# -- the actual charecter set conversion
iconv -c -f UTF-8 -t ISO8859-1 my_utf8_file_temp.xml > my_file.xml

Everything is fine but one thing in one of my files. It seems like there is originally an invisible character at the beginning of the file. When I open my_file.xml in Notepadd ++, I see a SUB at the beginning of the file. In Unix VI I see ^Z.

What and where should I add to my unix script to delete those kinds of characters.

Thank you

Henk Langeveld
  • 8,088
  • 1
  • 43
  • 57
mlwacosmos
  • 4,391
  • 16
  • 66
  • 114
  • When your only problem is the first char, you can `sed '1 s/.//'`. – Walter A Jun 28 '17 at 21:35
  • @mlwacosmos, You are seeing the [byte order mark (BOM)](https://en.wikipedia.org/wiki/Byte_order_mark). There are multiple ways to remove it, many are addressed in [this answer](https://stackoverflow.com/questions/1068650/using-awk-to-remove-the-byte-order-mark). – randomir Jun 30 '17 at 19:59

1 Answers1

0

To figure out exactly what character(s) you're dealing with, isolate the line in question (in this case something simple like head -1 <file> should suffice) and pipe the result to od (using the appropriate flag to display the character(s) in the desired format):

head -1 <file> | od -c   # view as character
head -1 <file> | od -d   # view as decimal
head -1 <file> | od -o   # view as octal
head -1 <file> | od -x   # view as hex

Once you know the character(s) you're dealing with you can use your favorite command (eg, tr, sed) to remove said character.

markp-fuso
  • 28,790
  • 4
  • 16
  • 36
  • Using head -1 | od -c, the invisible character is : 357 273 277 – mlwacosmos Jun 29 '17 at 06:42
  • you're already using `tr` to remove a character, so add these characters to that `tr` command; alternatively, do something like: `sed 's/\x\x\x` with the hex codes output by `od -x`.: – markp-fuso Jun 29 '17 at 12:47