12

I download a file from the OECD http://stats.oecd.org/Index.aspx?datasetcode=CRS1 ('CRS 2013 data.txt') by selecting Export-> Related files. I want to work with this file in Ubuntu (14.04 LTS).

When I run:

dos2unix CRS\ 2013\ data.txt

I see:

dos2unix: Binary symbol 0x0004 found at line 1703
dos2unix: Skipping binary file CRS 2013 data.txt

I check the encoding of the file with:

file --mime-encoding CRS\ 2013\ data.txt

and see:

CRS 2013 data.txt: utf-16le

I do:

iconv -l | grep utf-16le

which doesn't return anything so I do:

iconv -l | grep UTF-16LE

which returns:

UTF-16LE//

Then I run:

iconv --verbose -f UTF-16LE -t UTF-8 CRS\ 2013\ data.txt -o crs_2013_data_temp.txt

and check:

file --mime-encoding crs_2013_data_temp.txt

and see:

crs_2013_data_temp.txt: utf-8

Then I try:

dos2unix crs_2013_data_temp.txt

and get:

dos2unix: Binary symbol 0x04 found at line 1703
dos2unix: Skipping binary file crs_2013_data_temp.txt

I then try to force it:

dos2unix -f crs_2013_data_temp.txt

It works i.e., dos2unix completes the conversion without bailing out/complaining but when I open the file I see entries like "FoÄŤa and ÄŚajniÄŤe".

My question is why? Is it because the BOM is not visible to dos2unix? Because it's missing? Have I not done the conversion right? How do I convert this file (correctly?) so that I can read it.

dw8547
  • 258
  • 1
  • 2
  • 11

3 Answers3

6

That 0x0004 character you are seeing in your file has nothing at all to do with the BOM (which is fine, by the way) -- it's an EOT (End of Transmission) character from the C0 control set, and has been at that codepoint since 7-bit ASCII was the new hotness. (It's also the familiar Control-D Unix EOF sequence.)

Unfortunately, the pre-dos2unix way of applying tr to the file to strip the carriage returns won't work directly since the file is UTF-16; since iconv works for you, though, you can use it to convert to UTF-8 (which tr will work on), and then run this tr command:

tr -d '\r' < crs_2013_data_temp.txt > crs_2013_data_unix.txt

in order to get the text file into the Unix line ending convention. You will have to keep an eye on whatever tools you're feeding the file to, though, to make sure that they don't choke on the Ctrl-D/EOT character; if they do, you can use

tr -d '\004' < crs_2013_data_unix.txt > crs_2013_data_clean.txt

to get rid of it.

As to how it got there in the first place? I blame the Belgians for letting it sneak into the data they gave the OECD, which they probably keyed in with cat - > file or some other similarly underwhelming means. Also, some text editors try to be a bit too helpful by hiding control characters, even though other tools will bail out when they see them as they think you just stuffed a binary file in that was pretending to be text for a while.

LThode
  • 1,843
  • 1
  • 17
  • 28
  • How do I/you know that the BOM is OK? Is it because: `file --mime-encoding CRS\ 2013\ data.txt` returns `utf-16le` and `dos2unix` attempts to convert the file until it finds the first binary symbol and `dos2unix` can only detect if a file is in the UTF-16 format if the file has a BOM? – dw8547 Apr 29 '15 at 09:06
  • I tried both of these commands and then tried to `dos2unix` the crs_2013_data_clean.txt file and discovered another 9 binary symbols (0x03, 0x1c, 0x1d, 0x00,0x01,0x02,0x05,0x19 and 0x13). After I stripped them out using the command that you suggested, `dos2unix` finally worked. At this stage, should I be using the `-m` flag with `dos2unix` to add the BOM? – dw8547 Apr 29 '15 at 09:08
  • @user4842454 -- the BOM is OK -- I verified this by manually inspecting the file in vim. You don't need to run `dos2unix` on it any longer after the first `tr` command I gave you, by the way -- it's equivalent to `dos2unix` for a UTF-8, ISO-8859-X, or ASCII file. – LThode Apr 29 '15 at 12:54
  • I tried `:setlocal bomb?` in vim and got `bomb 1,1 Top`, is that it? OK (regarding not needing to run `dos2unix` after removing the carriage returns with the `tr` command). Do I need to strip out the remaining binary symbols (END OF TEXT (\003,0x03), INFORMATION SEPARATOR FOUR (\034, 0x1c))? I am asking because running `dos2unix` after running the second `tr` command alerted me to the presence of these additional binary symbols and if I do need to strip them out, how would I find out that they are present otherwise? – dw8547 Apr 29 '15 at 14:20
  • @user4842454 -- it depends entirely on whether the tools you are feeding them to are fazed by the occasional control character in the data. (And your results in VIM show that the BOM is just fine.) – LThode Apr 30 '15 at 12:42
2

I think this command is OK for your problem:

cat file | tr -d "\r" > new_file
Eric Aya
  • 69,473
  • 35
  • 181
  • 253
0

That's how I solved:

find . -type f -exec sed -i 's/\r//' {} \;