2

My file looks like this :

1-0039.1        EMBL    transcript      1       1524    .       +       .       transcript_id "1-0039.1.2"; gene_id "1-0039.1.2"; gene_name "dnaA"
1-0039.1        EMBL    CDS     1       1524    .       +       0       transcript_id "1-0039.1.2"; gene_name "dnaA";
1-0039.1        EMBL    transcript      1646    1972    .       +       .       transcript_id "1-0039.1.5"; gene_id "1-0039.1.5"; gene_name "ORF0009"

I want to change all "1-0039.1" values in the first column to 1

so I have tried: awk -vOFS='\t' '{$1="1"; print}' 1-0039.gtf > 1-0039_modified.gtf And the output looks like this:

1       EMBL    transcript      1       1524    .       +       .       transcript_id   "1-0039.1.2";   gene_id "1-0039.1.2";   gene_name       "dnaA"
1       EMBL    CDS     1       1524    .       +       0       transcript_id   "1-0039.1.2";   gene_name       "dnaA";
1       EMBL    transcript      1646    1972    .       +       .       transcript_id   "1-0039.1.5";   gene_id "1-0039.1.5";   gene_name       "ORF0009"
1       EMBL    CDS     1646    1972    .       +       0       transcript_id   "1-0039.1.5";   gene_name       "ORF0009";
1       EMBL    transcript      2023    2940    .       +       .       transcript_id   "1-0039.1.7";   gene_id "1-0039.1.7";   gene_name       "ORF0586"
1       EMBL    CDS     2023    2940    .       +       0       transcript_id   "1-0039.1.7";   gene_name       "ORF0586";
1       EMBL    transcript      2897    3223    .       +       .       transcript_id   "1-0039.1.9";   gene_id "1-0039.1.9";   gene_name       "ORF0009"

As you can see values in the last column were space-separated but now they are tab separated. My question is how do I change the first column only without messing up other columns?

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47

3 Answers3

2

With awk:

awk 'BEGIN{ FS=OFS="\t" } $1=="1-0039.1"{ $1="1" } { print }' 1-0039.gtf > 1-0039_modified.gtf

Output:

1       EMBL    transcript      1       1524    .       +       .       transcript_id "1-0039.1.2"; gene_id "1-0039.1.2"; gene_name "dnaA"
1       EMBL    CDS     1       1524    .       +       0       transcript_id "1-0039.1.2"; gene_name "dnaA";
1       EMBL    transcript      1646    1972    .       +       .       transcript_id "1-0039.1.5"; gene_id "1-0039.1.5"; gene_name "ORF0009"

See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

Cyrus
  • 84,225
  • 14
  • 89
  • 153
2

Addressing OP's issue with the spaces in the last field being converted to tabs ...

As currently coded:

  • no input field delimiter is defined so all white space is treated as input field delimiters
  • what OP thinks of as the 'last field' (eg, transcript_id "1-0039.1.2"; gene_name "dnaA";) will actually be treated as 4 separate space-delimited fields
  • the output field delimiter is defined as a tab so all contiguous groups of spaces (input) will be converted to tabs (output), hence the reason OP's 'last field' (which awk actually treats as 4 separate fields) is split apart with tabs

To maintain the spaces in the 'last field' OP needs to tell awk what the input field delimiter is.

If the input field delimiter is a tab then one idea for tweaking OP's current code:

awk 'BEGIN { FS=OFS="\t"} {$1="1"; print}' 1-0039.gtf

If the input field delimiter is 2+ spaces then a couple alternatives:

awk 'BEGIN { FS="[ ]{2,}"; OFS="\t"} {$1="1"; print}' 1-0039.gtf

# or

awk 'BEGIN { FS="[ ][ ]+"; OFS="\t"} {$1="1"; print}' 1-0039.gtf
markp-fuso
  • 28,790
  • 4
  • 16
  • 36
1
awk '{sub(/^1-0039.1/,1); print}'  1-0039.gtf > 1-0039_modified.gtf

But the sed solutions in the comments will do the same job faster.

Annotation:

Unfortunately the question gives contradictory information:

  1. The sample has space separated fields with varying count of spaces
  2. You write about tabs between the fields and want to keep the space at the last column.

The identical view can be created by tab separation at a tab width of 8 spaces using one tab per field.

So the solution has to deal with this conflict.

This is the reason why my solution does not use the field splitting feature of awk but just has a look at the pattern of the first column.

Like this the solution does not rely on an assumption for propper work. The delimiter can be of any type and count and the solution will do the job.
Especially it will not change the current state of the column delimiter(s).


Thanks for the comments below. They have their point, but keep it simple for understanding was the first thought.

So here an alternate edition to get more flexibility in the first column:

awk '{sub(/^1-[^ \t]*/,1); print}'  1-0039.gtf > 1-0039_modified.gtf

As this variant will split at the first space that possibly should not be a delimiter the following version will respect a single space as part of the content of the first column field:

awk '{sub(/^1- ?[^ \t]*/,1); print}'   1-0039.gtf > 1-0039_modified.gtf
dodrg
  • 1,142
  • 2
  • 18