5

I've got a file with UTF-8 (Without BOM) coding. File is being created on Windows site and it's being transfered to Linux server through SFTP. Using cat -e on it, I get something like this:

cat -e file.txt

M-oM-;M-?test13;hbana0Kw;$
lala;LjgX$

Now, I know that M-oM-;M-? stands for UTF-8 (Without BOM). Is there a way to remove it from file but preseve its coding?

NRG
  • 233
  • 3
  • 14
  • 1
    If it's not a BOM, it is actual character data which you cannot remove without altering the actual contents of the file. However, it looks to me like a BOM. What's the output of `cut -b1-3 file | od -ch`? – tripleee Nov 24 '14 at 12:25
  • 1
    Hi, it's `0000000 357 273 277 \n l a l \n bbef 0abf 616c 0a6c 0000010` – NRG Nov 24 '14 at 12:30
  • So it's a BOM with the bytes switched, aka a [ZERO-WIDTH NO-BREAK SPACE](http://www.fileformat.info/info/unicode/char/FEFF/index.htm). (The output from `od` is in little-endian format, further confusing issues.) – tripleee Nov 24 '14 at 12:37
  • Great, so I was mistaken what it is. Now, is there a way to remove it from Linux side or should I try to remove it on Windows site? – NRG Nov 24 '14 at 12:42
  • Trivially `LC_ALL=C cut -b4- file >newfile` – tripleee Nov 24 '14 at 12:44
  • Yes, but it also cuts every single line in file removing first 3 characters, so i get for second line - a;LjgX instead of lala;LjgX – NRG Nov 24 '14 at 13:12
  • 1
    `sed -e '1s/^.//' file.txt` – Etan Reisner Nov 24 '14 at 13:27
  • Alternative to sed: `tail --bytes=+3 < oldfile > newfile` – ua2b Nov 24 '14 at 13:30
  • @EtanReisner Still not right. It doesn't matter if I use sed on file or cut - it takes away first 3 characters from each line. – NRG Nov 24 '14 at 13:36
  • Then you missed the leading `1` on my sed command. `1s/^.//` not `s/^.//`. – Etan Reisner Nov 24 '14 at 13:41
  • @EtanReisner even worse ... I've run your `sed` command and then did `cat` on old file. For now `sed` works fine. – NRG Nov 24 '14 at 14:42
  • Now I'm confused. Did `sed` work or not? That `sed` command should only modify the first line and should print out every other line as-is. What version of `sed` are you using? – Etan Reisner Nov 24 '14 at 15:28
  • @EtanReisner Sorry, my answer was a little misleading. I meant that Your answer was right - `sed -e '1s/^.//'` works fine. I was checking wrong file for results. Can You explain to me what exactly does this command? – NRG Nov 24 '14 at 15:55
  • possible duplicate of [ character showing up in files. How to remove them?](http://stackoverflow.com/questions/7297888/ufeff-character-showing-up-in-files-how-to-remove-them) – tripleee Nov 24 '14 at 16:35
  • Similar to https://unix.stackexchange.com/q/381230/22653 – kbulgrien Aug 23 '23 at 23:16

2 Answers2

4

To remove the BOM from the first line of a file you can use something like this sed -e '1 s/^.//' file.txt.

sed commands have two parts an address and a command. Most of the time you see sed used without addresses (which means apply to all lines) but you can restrict the command operation to only specific lines by using addresses.

In this case the address is 1 meaning the first line. So the replacement only applies to the first line and every line is printed (as that is the default sed behaviour).

Etan Reisner
  • 77,877
  • 8
  • 106
  • 148
3

When transferring file from Windows to Linux, apply dos2unix command. This removes the BOM symbol and transforms line-edings to Unix style.

dos2unix file.txt
LoMaPh
  • 1,476
  • 2
  • 20
  • 33