Command to remove all but select columns for each file in unix directory

Question

I have a directory with many files in it and want to edit each file to only contain a select few columns.

I have the following code which will only print the first column

for i in /directory_path/*.txt; do awk -F "\t" '{ print $1 }' "$i"; done

but if I try to edit each file by adding >'$I' as below then I lose all the information in my files

for i in /directory_path/*.txt; do awk -F "\t" '{ print $1 }' "$i" > "$i"; done

However I want to be able to remove all but a select few columns in each file for example 1 and 3.

but wouldn't ```> output ``` then create a new file which I don't want — British Bioinformatician, Jun 17 '21 at 13:43
I want to edit the originals eg say I had 10 files with 5 columns. I want a solution where I still have 10 file but with only the first column — British Bioinformatician, Jun 17 '21 at 13:46
ok so no appending. Use `sed -E -i.bak 's/^([^[:blank:]]+).*/\1/' /directory_path/*.txt` — anubhava, Jun 17 '21 at 13:47
"Append" means to add on to the end. You appear to be trying to *remove* all but the first column from each file. — chepner, Jun 17 '21 at 14:08

dawg · Accepted Answer · 2021-06-17T14:40:53.377

5

Given:

cat file
1 2 3
4 5 6

You can do in place editing with sed:

sed -i.bak -E 's/^([^[:space:]]*).*/\1/' file 

cat file
1
4

If you want freedom to work with multiple columns and have in place editing, use GNU awk that also supports in place editing:

gawk -i inplace '{print $1, $3}' file

cat file 
1 3
4 6

If you only have POSIX awk or wanted to use cut you generally do this:

Modify the file with awk, cut, sed, etc
Redirect the output to a temp file
Rename the temp file back to the original file name.

Like so:

awk '{print $1, $3}' file >tmp_file; mv tmp_file file

Or with cut:

cut -d ' ' -f 1,3 file >tmp_file; mv tmp_file file

To do a loop on files in a directory, you would do:

for fn in /directory_path/*.txt; do
    awk -F '\t' '{ print $1 }' "$fn" >tmp_file 
    mv tmp_file "$fn"
done

edited Jun 17 '21 at 14:40

answered Jun 17 '21 at 13:50

dawg

98,345
23
131
206

Okay thanks, I'm not experienced with sed command, how would I keep say 1 and 3 – British Bioinformatician Jun 17 '21 at 13:54
Keep 1 and 3 and then what with 4,5,6? – dawg Jun 17 '21 at 13:55
well in your example you have kept 1, how would I also keep column 3 in the same command – British Bioinformatician Jun 17 '21 at 13:58
when I use the gawk command I get a rather long list of what the GAWK does but doesn't actually do anything. – British Bioinformatician Jun 17 '21 at 14:08
I see how creating a temp and then renaming can be done, not sure how I could apply this to every file in my directory however – British Bioinformatician Jun 17 '21 at 14:12
See example. You only need to be using one tmp file for a group of files since it is overwritten every time – dawg Jun 17 '21 at 14:17
And if you want to have a backup of each file, that is easily done as well. 1) add the extension to the file name in the loop 2) copy the original file to the backup; 3) redirect the output of awk to the tmp file; 4) mv tmp file to original. – dawg Jun 17 '21 at 14:38
Bit of a weird one. When I use ```awk -F "\t" 'NF != 7' file ``` to check how many columns are in my file they all print out. Whereas ```awk -F "\t" 'NF != 1' file ``` prints nothing out meaning the file thinks it has one column. My file which originally had something like 15 columns and now has 7 prints thinks it only has 1 column, any ideas? – British Bioinformatician Jun 17 '21 at 15:43
Is the file truly separated by tabs or spaces? Try NOT setting `-F` so that awk splits on any whitespace. – dawg Jun 17 '21 at 16:18
yes it is, if I remove the '\t' then files are wiped blank – British Bioinformatician Jun 17 '21 at 16:21
Ask a new question and post example data – dawg Jun 17 '21 at 16:25
I've used cut instead of print and this has worked. Thanks – British Bioinformatician Jun 17 '21 at 16:30

score 0 · Answer 2 · answered Oct 07 '21 at 14:17

Just to add a little more to @dawg's perfectly well working answer according to my use case.

I was dealing with CSVs, and standard CSV can have , in some values as long as it's in double quotes like for example, the below-mentioned row will be a valid CSV row.

col1,col2,col2

1,abc,"abc, inc"

But the command above was treating the , between the double quotes as delimiter too.
Also, the output file delimiter wasn't specified in the command.

These are the modifications I had to make for it handle the above two problems:

for fn in /home/ubuntu/dir/*.csv; do
    awk -F ',' '{ FPAT = "([^,]*)|(\"[^\"]+\")"; OFS=","; print $1,$2 }' "$fn" >tmp_file 
    mv tmp_file "$fn"
done

The OSF delimiter will be the diameter of the output/result file.
The FPAT handles the case of , between quotation mark.

The regex and the information for that is mentioned ins awk's official documentation in section 4.7 Defining Fields by Content.

I was led to that solution through this answer.

Command to remove all but select columns for each file in unix directory

2 Answers2