6

I'm trying to transcode a bunch a files from ASCII to UTF-8.

For that, I tried using iconv:

iconv -f US-ASCII -t UTF-8 infile > outfile

-f ENCODING the encoding of the input

-t ENCODING the encoding of the output

Still that file didn't convert to UTF-8. It is a .dat file.

Before posting this, I searched Google and found information like:

ASCII is a subset of UTF-8, so all ASCII files are already UTF-8 encoded. The bytes in the ASCII file and the bytes that would result from "encoding it to UTF-8" would be exactly the same bytes. There's no difference between them.

Force encode from US-ASCII to UTF-8 (iconv)

Best way to convert text files between character sets?

Still the above links didn't help.

Even though it is in ASCII it will support UTF-8 as UTF-8 is a super set, the other party who is going to receive the files from me need file encoding as UTF-8. He just need file format as UTF-8.

Any suggestions please.

Angela
  • 3,050
  • 2
  • 30
  • 37
Ram
  • 307
  • 3
  • 9
  • 19
  • 1
    It's not at all clear what the problem is - just give the person the original ASCII files. If they're genuine ASCII, they're already UTF-8, so they should be fine. What's actually going wrong? – Jon Skeet Feb 07 '15 at 09:02
  • @Jon Skeet The other party expecting the file format as UTF-8, When I tried with the command file -i outfile it's returning ascii but they want it as utf-8 strictily. Even though ascii is subset of utf-8. – Ram Feb 07 '15 at 09:12
  • 4
    It *is* "UTF-8 strictly" if it's genuinely ASCII. Based on your comment, it sounds like the other party is basically broken, if they're rejecting ASCII files because of the output of `file`. They should accept that ASCII files are UTF-8 files, and just continue to process it anyway. – Jon Skeet Feb 07 '15 at 10:42
  • @JonSkeet In the absence of additional details, I would be inclined to agree. It would probably be worth suggesting and encouraging them to accept both responses from 'file'. Their API will be more flexible and robust, and they'll save themselves from having to have this exact discussion over and over with others using it. If they are unable or unwilling to do so, then at least a very explicit statement in their documentation that the BOM is required in the input file, using that precise language, would probably also go a long way. – Timothy Johns Feb 07 '15 at 19:23

1 Answers1

19

I'm a little confused by the question, because, as you indicated, ASCII is a subset of UTF-8, so all ASCII files are already UTF-8 encoded.

If you're sending files containing only ASCII characters to the other party, but the other party is complaining that they're not 'UTF-8 Encoded', then I would guess that they're referring to the fact that the ASCII file has no byte order mark explicitly indicating the contents are UTF-8.

If that is indeed the case, then you can add a byte order mark using the answer here:

iconv: Converting from Windows ANSI to UTF-8 with BOM

If the other party indicates that he does not need the 'BOM' (Byte Order Mark), but is still complaining that the files are not UTF-8, then another possibility is that your initial file is not actually ASCII, but rather contains characters that are encoded using ANSI or ISO-8859-1.

Edited to add the following experiment, after comment from Ram regarding the other party looking for the type using the 'file' command

Tims-MacBook-Pro:~ tjohns$ echo 'Stuff' > deleteme
Tims-MacBook-Pro:~ tjohns$ cat deleteme
Stuff
Tims-MacBook-Pro:~ tjohns$ file -I deleteme
deleteme: text/plain; charset=us-ascii
Tims-MacBook-Pro:~ tjohns$ echo -ne '\xEF\xBB\xBF' > deleteme
Tims-MacBook-Pro:~ tjohns$ echo 'Stuff' >> deleteme
Tims-MacBook-Pro:~ tjohns$ cat deleteme
Stuff
Tims-MacBook-Pro:~ tjohns$ file -I deleteme
deleteme: text/plain; charset=utf-8
Community
  • 1
  • 1
Timothy Johns
  • 1,075
  • 7
  • 17
  • Hi @Timothy Johns. Thanks for your explanation. The other party is checking file format using file -i outfile it's returning ascii there they want it as utf-8 to process further for them. – Ram Feb 07 '15 at 09:15
  • 1
    @Ram In that case I'm about 98% certain the other party is looking for the byte order mark. On Mac OS 'file' will output "text/plain; charset=utf-8" if it's there, and "text/plain; charset=us-ascii" if it's not. I'll edit the answer above to add an experiment. – Timothy Johns Feb 07 '15 at 18:41
  • Hi @Timothy Johns I was working in Linux Environment. The reason they are asking for UTF-8 is they want support few more characters which are not available in ASCII. Please note that all this is to process the data in Hadoop(data world). – Ram Feb 09 '15 at 08:02
  • Hi @Timothy Johns. Thanks for all your inputs. I tried using the following command and it converted ASCII file to UTF-8 format. (printf "\357\273\277";cat inputfile) > outputfile; When I gave ASCII inputfile it is returning me UTF-8 outputfile. – Ram Feb 10 '15 at 07:49