1

System: Linux. Bash 4.

I have the following file, which will be read into a script as a variable:

/path/sample_A.bam A 1
/path/sample_B.bam B 1
/path/sample_C1.bam C 1
/path/sample_C2.bam C 2 

I want to append "_string" at the end of the filename of the first column, but before the extension (.bam). It's a bit trickier because of containing the path at the beginning of the name.

Desired output:

/path/sample_A_string.bam A 1
/path/sample_B_string.bam B 1
/path/sample_C1_string.bam C 1
/path/sample_C2_string.bam C 2 

My attempt: I did the following script (I ran: bash script.sh):

List=${1};
awk -F'\t' -vOFS='\t' '{ $1 = "${1%.bam}" "_string.bam" }1' < ${List} ;

And its output was:

${1%.bam}_string.bam
${1%.bam}_string.bam
${1%.bam}_string.bam
${1%.bam}_string.bam

Problem: I followed the idea of using awk for this substitution as in this thread https://unix.stackexchange.com/questions/148114/how-to-add-words-to-an-existing-column , but the parameter expansion of ${1%.bam} it's clearly not being recognised by AWK as I intend. Does someone know the correct syntax for that part of code? That part was meant to mean "all the first entry of the first column, except the last part of .bam". I used ${1%.bam} because it works in Bash, but AWK it's another language and probably this differs. Thank you!

Inian
  • 80,270
  • 14
  • 142
  • 161
msimmer92
  • 397
  • 3
  • 16
  • You need to have imported `$1` in the context of `Awk`. Your attempt does not work because `awk` does not recognize `$1` as a place-holder for the content stored. – Inian Jan 29 '19 at 13:53
  • $1 is not meant to be calling a variable named "$1", but rather it is supposed to be the first column of the file (given that the file has rows and columns). The problem is different here and that thread didn't solve it. – msimmer92 Jan 29 '19 at 13:57

4 Answers4

3

Note that the paramter expansion you applied on $1 won't apply inside awk as the entire command body of the awk command is passed in '..' which sends content literally without applying any shell parsing. Hence the string "${1%.bam}" is passed as-is to the first column.

You can do this completely in Awk

awk -F'\t' 'BEGIN { OFS = FS }{ n=split($1, arr, "."); $1 = arr[1]"_string."arr[2] }1'  file

The code basically splits the content of $1 with delimiter . into an array arr in the context of Awk. So the part of the string upto the first . is stored in arr[1] and the subsequent split fields are stored in the next array indices. We re-construct the filename of your choice by concatenating the array entries with the _string in the filename part without extension.

Inian
  • 80,270
  • 14
  • 142
  • 161
  • This also gives me the output I wanted. I selected the other answer as the main answer because it was the first one (and also the syntax is easier to understand and more straightforward, in my opinion). But which one is more correct or better? In case this one is, I will switch it. – msimmer92 Jan 29 '19 at 14:10
  • 1
    @msimmer92: I'll leave it up-to you to decide the one that you find it most useful. At the end of the day, you need to take the one that you find easy to work with or easy to adapt in your further coding efforts – Inian Jan 29 '19 at 14:13
2

If I understood your requirement correctly, could you please try following.

val="_string"
awk -v value="$val" '{sub(".bam",value"&")} 1'  Input_file

Brief explanation: -v value means passing shell variable named val value to awk variable variable here. Then using sub function of awk to substitute string .bam with string value along with .bam value which is denoted by & too. Then mentioning 1 means print edited/non-edtied line.

Why OP's attempt didn't work: Dear, OP. in awk we can't pass variables of shell directly without mentioning them in awk language. So what you are trying will NOT take it as an awk variable rather than it will take it as a string and printing it as it is. I have mentioned in my explanation above how to define shell variables in awk too.

NOTE: In case you have multiple occurences of .bam then please change sub to gsub in above code. Also in case your Input_file is TAB delmited then use awk -F'\t' in above code.

RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
  • @RavinderSingh13 How would you solve this with a regular expression? (in the case that instead of ".bam", you want to substitute any extension) (imagine you apply that to a list of text files and some are .bam and some are .fastq) – msimmer92 Feb 10 '19 at 15:11
  • @msimmer92, I haven't checked it but can you please try replacing `sub(".bam",value"&")` to `gsub(/.bam|.fastq/,value"&")` and let me know then? – RavinderSingh13 Feb 11 '19 at 03:43
2
sed -i 's/\.bam/_string\.bam/g' myfile.txt

It's a single line with sed. Just replace the .bam with _string.bam

jasonmclose
  • 1,667
  • 4
  • 22
  • 38
  • It also works, but you should maybe edit and clarify that the -i edits the original file. In my case I didn't want to do that (I want to redirect that output to another file, so I have both), so I took out the -i. – msimmer92 Jan 30 '19 at 12:28
  • 1
    Ah. Ok. That's fine. Just cat the file, pipe to the sed, and direct to a new file. – jasonmclose Jan 30 '19 at 15:18
  • How would you solve this with a regular expression? (in the case that instead of ".bam", you want to substitute any extension) (imagine you apply that to a list of text files and some are .bam and some are .fastq) I mean, which would be the correct regular expression to use? – msimmer92 Feb 10 '19 at 15:18
  • Well, I don't quite understand the question. But if you want to match multiple things, you can always just add in more expressions with sed. So if I understand your question correctly, you can do this: `cat myfile.txt | sed -e 's/\.bam/_string\.bam/g' -e 's/\.fastq/_string\.fastq/g'` – jasonmclose Feb 11 '19 at 13:11
  • If you don't mind changing any and all file extensions in a list of text files, and the filename/extension are always the last characters of the line, you can do this: `cat myfile.txt | sed 's/\(\.[a-zA-Z]\+$\)/_string\1/g'` – jasonmclose Feb 11 '19 at 13:13
1

You can try this way with awk :

awk -v a='_string' 'BEGIN{FS=OFS="."}{$1=$1 a}1' infile
ctac_
  • 2,413
  • 2
  • 7
  • 17