Can't remove first two encode characters using text editors in linux

Question

When I use text editor for seeing content of file a.csv it shows me:

aaa bbb ccc ddd eee fff ggg hhh iii jjj kkk

But when I cat it I have:

��aaa   bbb ccc ddd eee fff ggg hhh iii jjj kkk

So when I want to remove first to characters �� I can't do that. For example:

cat a.csv | sed 's/\(.\{2\}\)//'

The result is:

��aa    bbb ccc ddd eee fff ggg hhh iii jjj kkk

Take a look at http://stackoverflow.com/q/1068650/1679537 – xlecoustillier Dec 29 '14 at 11:30 — xlecoustillier, Dec 29 '14 at 11:30

cmaster - reinstate monica · Accepted Answer · 2014-12-29T13:22:52.003

3

This looks like a byte order mark that's prepended to your text.

If that is correct, you can fix this by converting your file to an encoding that doesn't use a byte order mark (for example plain UTF-8), and these two characters should be gone.

How you change the encoding of a file depends on the editor you use, in vim the command to use is :set nobomb.

edited Dec 29 '14 at 13:22

answered Dec 29 '14 at 11:47

cmaster - reinstate monica

38,891
9
62
106

Thank you,,,How is it possible? – MLSC Dec 29 '14 at 11:52
The BOM is three bytes in UTF-8, though; but perhaps the terminal is not displaying every byte as an unknown character. +1 -- this would be my first speculation, too. – tripleee Dec 29 '14 at 11:53
I didn't get it... pardon – MLSC Dec 29 '14 at 11:57
@MortezaLSC How you change the encoding depends on your editior. Some have the option in the `Save as...` dialog, some have a special menu entry for this, some allow you to change the encoding via a pop-up or context menu. Quite a few editors don't even have the option. – cmaster - reinstate monica Dec 29 '14 at 11:59
I am using `vim` to see the content of a.csv...These are lots of file..I want to use bash for dealing with such issues – MLSC Dec 29 '14 at 12:00
@tripleee There are byte order marks with two, three, four, and even five bytes, see this link for a complete list: https://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding – cmaster - reinstate monica Dec 29 '14 at 12:02
1

In `vim` you can just do `:set nobomb` before writing the file to remove a byte order mark (no idea why they called the option `bomb`...). – cmaster - reinstate monica Dec 29 '14 at 12:08
It is of course possible that there are invisible zero bytes but the available information suggests an 8-bit encoding on the OP's terminal and in the file. – tripleee Dec 29 '14 at 12:18
@cmaster Thank you, +1...good solution... If I want to do it for more than 1000 files. Do we have a rational(automated\) way? – MLSC Dec 29 '14 at 12:22
1

Yes, you have two options for this: 1. use vim's `argdo` command, 2. use the shell to iterate over the files and call `vim $file '+set nobomb' +w +q` in the loop. The first will likely be faster, and it won't flash your terminal screen as the second one will. Nevertheless, the `+` trick is quite useful to know. – cmaster - reinstate monica Dec 29 '14 at 13:14
This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. – Ilija Dimov Dec 29 '14 at 13:18
@IlijaDimov Rephrased my answer a bit, hope you like it better now. – cmaster - reinstate monica Dec 29 '14 at 13:23

score 0 · Answer 2 · answered Dec 29 '14 at 12:07

0

This might work for you (GNU sed):

sed -r 's/(\o357\o277\o275){2}//g' file

this removes any double occurance of the octal triple \357\277\275.

N.B. to recognize the octal value use sed -n l file and scan for values begining \nnn

answered Dec 29 '14 at 12:07

potong

55,640
6
51
83

It doesn't work for me .... :( – MLSC Dec 29 '14 at 12:12
Why do you expect and require two occurrences of this sequence? – tripleee Dec 29 '14 at 14:50

Can't remove first two encode characters using text editors in linux

2 Answers2