Deleting first n rows and column x from multiple files using Bash script

Question

I am aware that the "deleting n rows" and "deleting column x" questions have both been answered individually before. My current problem is that I'm writing my first bash script, and am having trouble making that script work the way I want it to.

file0001.csv (there are several hundred files like these in one folder)

Data number of lines 540
No.,Profile,Unit
1,1027.84,µm
2,1027.92,µm
3,1028,µm
4,1028.81,µm

Desired output

I am able to use sed and cut individually but for some reason the following bash script doesn't take cut into account. It also gives me an error "sed: can't read ls: No such file or directory", yet sed is successful and the output is saved to the original files.

sem2csv.sh

for files in 'ls *.csv'  #list of all .csv files
do
  sed '1,2d' -i $files | cut -f  '1-2' -d  ','
done

Actual output:

1,1027.84,µm
2,1027.92,µm
3,1028,µm
4,1028.81,µm

I know there may be awk one-liners but I would really like to understand why this particular bash script isn't running as intended. What am I missing?

score 2 · Accepted Answer · edited May 23 '17 at 11:52

2

The -i option of sed modifies the file in place. Your pipeline to cut receives no input because sed -i produces no output. Without this option, sed would write the results to standard output, instead of back to the file, and then your pipeline would work; but then you would have to take care of writing the results back to the original file yourself.

Moreover, single quotes inhibit expansion -- you are "looping" over the single literal string ls *.csv. The fact that you are not quoting it properly then causes the string to be subject to wildcard expansion inside the loop. So after variable interpolation, your sed command expands to

sed -i 1,2d ls *.csv

and then the shell expands *.csv because it is not quoted. (You should have been receiving a warning that there is no file named ls in the current directory, too.) You probably attempted to copy an example which used backticks (ASCII 96) instead of single quotes (ASCII 39) -- the difference is quite significant.

Anyway, the ls is useless -- the proper idiom is

for files in *.csv; do
  sed '1,2d' "$files" ...   # the double quotes here are important
done

Mixing sed and cut is usually not a good idea because you can express anything cut can do in terms of a simple sed script. So your entire script could be

for f in *.csv; do
    sed -i -e '1,2d' -e 's/,[^,]*$//' "$f"
done

which says to remove the last comma and everything after it. (If your sed does not like multiple -e options, try with a semicolon separator: sed -i '1,2d;s/,[^,]*$//' "$f")

edited May 23 '17 at 11:52

Community

1
1

answered Dec 15 '15 at 06:42

tripleee

175,061
34
275
318

1

Thank you for this extremely useful answer, this is exactly what I wanted to know. :D – biohazard Dec 15 '15 at 07:11
@biohazard No, not really. If there is a single script string, the `-e` is superfluous. Though on *BSD you will need to supply an option to `-i` -- but as your example wasn't using one, I assume you are on a different platform where this is not required. – tripleee Dec 15 '15 at 07:19
Hmm, it's still not deleting the third column... tried with both multiple -e and without. The character set is Shift-JIS, but that shouldn't matter since the columns are there... I am using Linux Mint with bash. – biohazard Dec 15 '15 at 07:21
Try with an explicit loop after all. I updated the answer. (Though I don't really see how it would cause the symptom you are describing.) – tripleee Dec 15 '15 at 07:24
This is very weird, the µm's still show up. Opening the output csv with vim shows 1,1027.84,<83>Êm^M and the terminal tells me this file might be a binary when I try opening it with the less command. – biohazard Dec 15 '15 at 07:28
Looks like proper shift-JIS displayed in some legacy 8-bit character set (CP1252?) to me. The `^M` is a DOS carriage return so that's probably the reason for any erratic behavior, though again, I don't see how it would exhibit the symptoms you describe. But this is a FAQ; `dos2unix` to the rescue. – tripleee Dec 15 '15 at 07:29
Yup, `dos2unix` took care of the `^M` but the 3rd column is still sticking around. I will continue trying and either open up a new question or give up and take care of it in R. Thanks! – biohazard Dec 15 '15 at 07:39
1

Some legacy `sed` variants might have issues with high-bit characters. Try `perl -pi 's/,[^,]*$//' file` on a single file just to see if that helps. Maybe play around with your locale settings, too. – tripleee Dec 15 '15 at 07:42
1

Problem solved! Inside the `for` loop, I successively used `dos2unix "$f"` to get rid of the DOS carriage returns, `iconv -f shift-jis -t utf-8 "$f" -o "$f"` to convert Shift-JIS to UTF-8 (the terminal finally stopped warning me about this being a binary file and properly displays μ) and then finally `sed -i '1,2d;s/,[^,]*$//' "$f"`. Thanks again for your help! – biohazard Dec 15 '15 at 08:02
1

A pipeline to a temporary file would probably be more efficient than repeatedly overwriting the destination file (`dos2unix`, then `iconv`, then `sed -i`). – tripleee Dec 15 '15 at 08:15

score 0 · Answer 2 · answered Dec 15 '15 at 06:42

0

You may use awk,

$ awk 'NR>2{sub(/,[^,]*$/,"",$0);print}' file
1,1027.84
2,1027.92
3,1028
4,1028.81

or

sed -i '1,2d;s/,[^,]*$//' file

1,2d; for deleting the first two lines.
s/,[^,]*$// removes the last comma part in remaining lines.

answered Dec 15 '15 at 06:42

Avinash Raj

172,303
28
230
274

Deleting first n rows and column x from multiple files using Bash script

2 Answers2