1

I want to cat multiple UTF-8 text files together without having multiple BOM's in the middle of the file. Is there a proper way to do this besides stripping the BOM from each file?

My issue is that, after stripping the BOM and catting the files together, I'm having trouble copying the data to a Postgres table. Postgres is complaining that the data is not UTF-8. I am able to copy one of the small, original with BOM files just fine. Just the combined file with all the BOMs stripped is causing issues.

Thanks.

user1272324
  • 51
  • 1
  • 6
  • 3
    Don't strip the BOM on the first file then? – Mat Feb 14 '13 at 18:39
  • 2
    See http://stackoverflow.com/a/4365180/469210 for information on adding a BOM back to a final concatenated file. However @Mat's suggestion of just leaving the BOM in the first file avoids that step. – borrible Feb 14 '13 at 18:43
  • side question - what is `BOM` ? Please let me know. – mtk Feb 14 '13 at 19:26
  • @mtk BOM stands for Byte-Order-Mark. – borrible Feb 14 '13 at 19:27
  • 2
    Let me get this right: Postgres rejects the UTF-8 data **unless there is a BOM**? Sounds liek a bug in Postgres! BOMs are evil and have no place in UTF-8. They just cause trouble such as what you are experiencing when concatenating strings [files] together. Postgres definitely should not be requiring one. – Celada Feb 15 '13 at 15:11

1 Answers1

2

There is no byte order ambiguity in UTF-8, and so the BOM is not necessary. No program which processes UTF-8 should require such a thing. If a BOM occurs accidentally at the start of a UTF-8 stream it is always the bytes EF BB BF. The correct method to remove the BOM from UTF-8 is to first check that it starts with these three bytes and then to delete those bytes. If you delete three bytes from a UTF-8 stream that does not start with these three bytes, then you are not deleting a BOM, and you could be corrupting the UTF-8.

Kaz
  • 55,781
  • 9
  • 100
  • 149