0

I am using a UTF8 file as input in a script to do some text processing, then a print my desires input from the script and the output is again UTF8. However, when I run the command sort -u from the terminal to the output file in UTF8 the redirected output of sort is charset=unknown-8bit. What can I can do solve this? why is my script working perfectly but terminal changes everything? (Mac OS X)

Feel free to ask the details you need. Thanks!

little_mice
  • 177
  • 6
  • 1
    What's your locale setting? Can you post a hexdump of the output and indicate how it's different from the expected one? – choroba May 17 '18 at 11:51
  • Hello choroba! could you explain to me how I can get the hexdump? – little_mice May 17 '18 at 13:30
  • I do not know whether this can be helpful or not but something really weird is happening. When I check the format of the input file, which is surely UTF8, being connected to a remote server, file -I displays charset=utf-8. Now, when I copied that file and repeat the same command offline it shows: charset=unknown-8bit. Is not this really strange? I would say that my computer is the wrong thing here but I have no idea how to solve this – little_mice May 17 '18 at 14:10
  • Maybe you just have a different version of the file utility. Try `hexdump -C file` to get a hexdump. On linux, if the command doesn't exist, you can try `xxd` or `od` as alternatives. – choroba May 17 '18 at 15:11
  • LC_ALL=C sort < your-file.txt – webmite May 17 '18 at 15:36
  • use `od -cab your_inputfile > ifile.out` and `od -cab outputfile_or_data > ofile.out` and using `sdiff -w 200 ifile.out ofile.out` and see what character is getting introduced. There's a command called `iconv -f your_fromFormat -t your_toFormat < file > resultant_to_FormatFile` that you can try. You should also check `shasum yourfile` to check the checksum and dump of bytes in the file. Setting `LC_ALL=C` will make the `sort` command to match the OS X and FreeBSD variant on Linux. https://stackoverflow.com/questions/27395317/why-does-utf-8-text-sort-in-different-order-between-os-x-and-linux – AKS May 17 '18 at 22:15
  • `iconv -f unknown-8bit -t utf-8 < yourOutPutFileContainingOutputData > yourOutPutFileContainingOutputData_inUTF-8_format` see if this helps. https://superuser.com/questions/151981/converting-the-encoding-of-a-text-file-mac-os-x – AKS May 17 '18 at 22:19

1 Answers1

1

Consider adding

export LC_ALL=C

to your .bashrc (or the equivalent for other shells).

Vlad K.
  • 300
  • 3
  • 11