2

Below is a toy text file with sample and trait information, and a measurement.

Sample3_trait1  8.5
Sample6_trait2 2.2
Sample7_trait1 9.2
Sample3_trait2 1.3
Sample6_trait1 10.0
Sample7_trait2 2.1

I would like to replace the sample column with something more informative, like the actual name of the sample (say a persons name). This would be relatively easy in sed if there were only 3 Samples, e.g.

sed  's/Sample3/john.D/g' file.txt

I could do this for each "sample". But i have 100s or thousands of sample names.

What id like to do is give sed a text file with two columns, the original and the replacement:

Sample3 john.D
Sample6 mary.D
Sample7 kelly.O
....
Sample1001 amy.P

And have them replaced wherever they appear throughout the file (globally), i.e., whereever Sample3 is found, replace with john.D.

Is this something that I could do with a loop in Bash? I could loop over a single column (row by row), but Im not sure what to do with matched columns.

Any help would be much appreciated.

LP_640
  • 579
  • 1
  • 5
  • 17
  • 1
    wrt `wherever they appear` - does `Sample1` appear in the text `Sample10_trait2`? How about in `FooSample1_trait2`? If the answer to either question is no then how can we identify the delimiters for `Sample`s, e.g. does the text to be matched always occur at the start of the line and is always followed by an underscore? And no, a loop in bash is always the wrong approach for manipulating text. – Ed Morton Mar 12 '15 at 18:24

2 Answers2

3

Use sed to convert the second file into a sed script that edits the first:

sed 's/\([^ ]*\) \(.*\)/s%\1_%\2_%/' file.2 > sed.script
sed -f sed.script file.txt
rm -f sed.script

No loops in the Bash code. Note the _ in the patterns; this is crucial to prevent Sample3 from mapping Sample300 to john.D00.

If, as you should be, you are worried about interrupts and concurrent runs of the script, then (a) use mktemp to generate a file name in place of sed.script, and (b) trap interrupts etc to make sure the script file name is removed:

tmp=$(mktemp "${TMPDIR:-/tmp}/sed.script.XXXXXX")
trap "rm -f $tmp; exit 1" 0 1 2 3 13 15
sed 's/\([^ ]*\) \(.*\)/s%\1_%\2_%/' file.2 > $tmp
sed -f $tmp file.txt
rm -f $tmp
trap 0
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
2

Using awk is better here:

awk -v OFS=_ 'NR==FNR{a[$1]=$2;next} $1 in a{$1=a[$1]} 1' names.txt FS=_ file.txt
john.D_trait1 8.5
mary.D_trait2 2.2
kelly.O_trait1 9.2
john.D_trait2 1.3
mary.D_trait1 10.0
kelly.O_trait2 2.1

Where names.txt is this:

Sample3 john.D
Sample6 mary.D
Sample7 kelly.O
anubhava
  • 761,203
  • 64
  • 569
  • 643