1

When I use text editor for seeing content of file a.csv it shows me:

aaa bbb ccc ddd eee fff ggg hhh iii jjj kkk

But when I cat it I have:

��aaa   bbb ccc ddd eee fff ggg hhh iii jjj kkk

So when I want to remove first to characters �� I can't do that. For example:

cat a.csv | sed 's/\(.\{2\}\)//'

The result is:

��aa    bbb ccc ddd eee fff ggg hhh iii jjj kkk
MLSC
  • 5,872
  • 8
  • 55
  • 89

2 Answers2

3

This looks like a byte order mark that's prepended to your text.

If that is correct, you can fix this by converting your file to an encoding that doesn't use a byte order mark (for example plain UTF-8), and these two characters should be gone.

How you change the encoding of a file depends on the editor you use, in vim the command to use is :set nobomb.

cmaster - reinstate monica
  • 38,891
  • 9
  • 62
  • 106
  • Thank you,,,How is it possible? – MLSC Dec 29 '14 at 11:52
  • The BOM is three bytes in UTF-8, though; but perhaps the terminal is not displaying every byte as an unknown character. +1 -- this would be my first speculation, too. – tripleee Dec 29 '14 at 11:53
  • I didn't get it... pardon – MLSC Dec 29 '14 at 11:57
  • @MortezaLSC How you change the encoding depends on your editior. Some have the option in the `Save as...` dialog, some have a special menu entry for this, some allow you to change the encoding via a pop-up or context menu. Quite a few editors don't even have the option. – cmaster - reinstate monica Dec 29 '14 at 11:59
  • I am using `vim` to see the content of a.csv...These are lots of file..I want to use bash for dealing with such issues – MLSC Dec 29 '14 at 12:00
  • @tripleee There are byte order marks with two, three, four, and even five bytes, see this link for a complete list: https://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding – cmaster - reinstate monica Dec 29 '14 at 12:02
  • 1
    In `vim` you can just do `:set nobomb` before writing the file to remove a byte order mark (no idea why they called the option `bomb`...). – cmaster - reinstate monica Dec 29 '14 at 12:08
  • It is of course possible that there are invisible zero bytes but the available information suggests an 8-bit encoding on the OP's terminal and in the file. – tripleee Dec 29 '14 at 12:18
  • @cmaster Thank you, +1...good solution... If I want to do it for more than 1000 files. Do we have a rational(automated\) way? – MLSC Dec 29 '14 at 12:22
  • 1
    Yes, you have two options for this: 1. use vim's `argdo` command, 2. use the shell to iterate over the files and call `vim $file '+set nobomb' +w +q` in the loop. The first will likely be faster, and it won't flash your terminal screen as the second one will. Nevertheless, the `+` trick is quite useful to know. – cmaster - reinstate monica Dec 29 '14 at 13:14
  • This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. – Ilija Dimov Dec 29 '14 at 13:18
  • @IlijaDimov Rephrased my answer a bit, hope you like it better now. – cmaster - reinstate monica Dec 29 '14 at 13:23
0

This might work for you (GNU sed):

sed -r 's/(\o357\o277\o275){2}//g' file

this removes any double occurance of the octal triple \357\277\275.

N.B. to recognize the octal value use sed -n l file and scan for values begining \nnn

potong
  • 55,640
  • 6
  • 51
  • 83