21

I've written a script that cleans up .csv files, removing some bad commas and bad quotes (bad, means they break an in house program we use to transform these files) using sed:

# remove all commas, and re-insert the good commas using clean.sed
sed -f clean.sed $1 > $1.1st

# remove all quotes
sed 's/\"//g' $1.1st > $1.tmp

# add the good quotes around good commas
sed 's/\,/\"\,\"/g' $1.tmp > $1.tmp1

# add leading quotes
sed 's/^/\"/' $1.tmp1 > $1.tmp2

# add trailing quotes
sed 's/$/\"/' $1.tmp2 > $1.tmp3

# remove utf characters
sed 's/<feff>//' $1.tmp3 > $1.tmp4

# replace original file with new stripped version and delete .tmp files
cp -rf $1.tmp4 quotes_$1

Here is clean.sed:

s/\",\"/XXX/g;
:a
s/,//g
ta
s/XXX/\",\"/g;

Then it removes the temp files and viola we have a new file that starts with the word "quotes" that we can use for our other processes.

My question is:
Why do I have to make a sed statement to remove the feff tag in that temp file? The original file doesn't have it, but it always appears in the replacement. At first I thought cp was causing this but if I put in the sed statement to remove before the cp, it isn't there.

Maybe I'm just missing something...

ROMANIA_engineer
  • 54,432
  • 29
  • 203
  • 199
SDGuero
  • 487
  • 1
  • 4
  • 12
  • 1
    Please post source for `clean.sed`. Which of the .tmpX files do feff first appear in? – wallyk Dec 29 '09 at 00:52
  • 2
    0xfeff is unicode byte order mark. Not sure what adds it in your case though. – Eugene Dec 29 '09 at 00:55
  • 2
    First question: Why do you create 4 temp files to do this instead of using in-place (sed -i) on $1.1st each time? Second: When does the byte order marker (feff) start appearing in your process? Is it there immediately after you run clean.sed? If so, you might want to post that script. Third [nitpick]: you don't need to escape double quotes when you're inside single quotes, and you never need to escape commas. 's/,/","/g' is a lot more readable than 's/\,/\"\,\"/g'. – glomad Dec 29 '09 at 00:58
  • ithcy, First question: Simple answer is that I didn't know any better. The sed documentation out there is pretty scattered and not too easy to follow. I pieced together this code to make something work. Thanks for the tips, I will work on implementing a cleaner version with your suggestions. The .sed file is a direct copy of someone elses code, I may have gotten on this website... second question: I just checked it out, it shows up after the first sed statement. thrid question: see question #1 answer... ;) – SDGuero Dec 29 '09 at 01:15
  • To answer Wally's first question, I did a little experiement and can see the is appearing after the first sed statement. I have posted the code to clean.sed Thanks! – SDGuero Dec 29 '09 at 01:27
  • i can tell you that's inefficient code. Show a sample of your csv files and what you don't want. Show also your final output. – ghostdog74 Dec 29 '09 at 01:43
  • Unless you really need all those intermediate stages with the `.tmpN` files, I would use a single sed call like `sed -e 's/\"//g' -e 's/\,/\"\,\"/g' -e ... $1.1st > $1.tmp4`. – ndim Dec 29 '09 at 01:47
  • I wouldn't worry too much about efficiency or brevity in this case. It looks like a one-off shell script that is not going to be run very often. If this is true, then breaking it out into multiple sed commands with relevant comments is a good idea. – glomad Dec 29 '09 at 02:13

3 Answers3

19

U+FEFF is the code point for a byte order mark. Your files most likely contain data saved in UTF-16 and the BOM has been corrupted by your 'cleaning process' which is most likely expecting ASCII. It's probably not a good idea to remove the BOM, but instead to fix your scripts to not corrupt it in the first place.

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • This is what I thought too, but he clearly states in the question that the BOM is not in the original file. – glomad Dec 29 '09 at 01:03
  • A BOM is invisible. My best guess given the information in the question is that the clean.sed script changes unprintable characters to their hex representation, and possibly also removes NUL characters. So the BOM maybe was there all along, it just becomes more visible after the "cleaning". – Mark Byers Dec 29 '09 at 01:07
  • here is clean.sed: s/\",\"/XXX/g; :a s/,//g ta s/XXX/\",\"/g; – SDGuero Dec 29 '09 at 01:12
  • I'm sure you're right, it's the only answer that makes sense. I'm just taking him at his word... (BOM is easily visible with cat utf16_file.txt) – glomad Dec 29 '09 at 01:16
  • Shouldn't vi display the BOM from the get go? If it is there, vi cannot see it in the original file but can see it after the sed edits. I posted clean.sed... Plesae let me know if this is the root cause. Thanks! :) – SDGuero Dec 29 '09 at 01:22
  • 1
    No, vi knows how to handle Unicode and will not display the BOM. Do a :set fenc in vi and it will show you the encoding of the current file. Mark Byers is correct, you are probably seeing a mangled BOM after your sed because sed is outputting ASCII. – glomad Dec 29 '09 at 01:28
  • ...So to summarize, your csv file is UTF-16 encoded, and sed is probably not going to work for you unless you have the option to convert the file to ASCII first. (Try man iconv) If you can't do that, use something like a simple python script to do the text replacements. – glomad Dec 29 '09 at 01:37
4

To get rid of these in GNU emacs:

  1. Open Emacs
  2. Do a find-file-literally to open the file
  3. Edit off the leading three bytes
  4. Save the file

There is also a way to convert files with DOS line termination convention to Unix line termination convention.

Anirvan
  • 6,214
  • 5
  • 39
  • 53
stinkoid
  • 41
  • 3
0

It happend to me when I wanted to echo lines in a file I previously cleared with: echo "" > somefile.txt

When I removed the file and run echo's again, the "feff" is not appearing anymore at file creation during the first echo.

Nik
  • 418
  • 4
  • 10