Replacing a range with data from another file

Question

I have a file with the following text (file1)-

1SER     CA    1   1.401   0.040   0.887
2GLN     CA    2   1.708  -0.155   1.002
3ALA     CA    3   1.870  -0.103   0.662
4GLU     CA    4   1.829   0.274   0.695

I have a separate file with similar text (file2)-

1MET     CA    1  17.704  15.987  17.370
2ARG     CA    2  17.811  16.145  17.712
3ARG     CA    3  17.634  16.267  18.034
4TYR     CA    4  17.465  16.615  18.002

My aim is to replace the characters in range 2-4 in file1 with the data in 2-4 of file2.

Desired output-

1MET     CA    1   1.401   0.040   0.887
2ARG     CA    2   1.708  -0.155   1.002
3ARG     CA    3   1.870  -0.103   0.662
4TYR     CA    4   1.829   0.274   0.695

i.e. the characters from 2-4 of file2 are placed in the bytes 2-4 of file1.

I know I can narrow down on the required region with cut -c 2-4 | sed ... but I'm not able to 'read' the data from a separate file and replace in place.

I have a feeling that it might be easier in awk but no column based answers please. It needs to be a solution based on the range of characters in the file (in this case 2-4).

ADDED EXAMPLE

The solution should be able to do this as well- file1-

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

file2-

BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Output-

ABBBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ABBBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ABBBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Can you bake an example with line or columns wich are not supposed to be replaced. I can't get what you're refereing to with range (which value is supposed to be in the range ?) — Tensibai, Sep 09 '16 at 13:41
It is important then to [edit] to reflect what you mean with _range 2-4 in file1_ — fedorqui, Sep 09 '16 at 13:46
@VarunM your first example is then false somewhere (result have the same output as f2 for columns 2 to 5). It's even less clear what you're trying to achieve now... — Tensibai, Sep 09 '16 at 13:50
I'm just trying to take the data in the bytes 2-4 from file2 and put them in the bytes 2-4 in file1. No pattern matching, no columns. — Chem-man17, Sep 09 '16 at 13:52
You're absolutely right, @Tensibai, I messed up. I was trying to be extra careful while writing the question out but I got careless with the important part. I'm sorry for any inconvenience! — Chem-man17, Sep 09 '16 at 13:58
Here we are :) Try to make the two files different enough to avoid ambiguity next time (I.e: the CA an numeric column the same in both file makes it harder to guess) :) — Tensibai, Sep 09 '16 at 13:59
When you say columns most people will assume you mean fields but then you talk about bytes which makes me think you mean characters (and you show an example that seems to be using characters) but then if you mean characters why would you say "bytes" instead of "characters"? Please clean up your question to pick 1 term (columns, bytes, fields, or characters) and if it's "bytes" then explain why "characters" doesn't work instead. — Ed Morton, Sep 09 '16 at 14:01

score 3 · Answer 1 · edited May 23 '17 at 12:33

If you want to replace columns, just store the data from file1 and replace it in file2:

$ awk 'FNR==NR {col1[FNR]=$1; col2[FNR]=$2; next} {$1=col1[FNR]; $2=col2[FNR]}1' f1 f2
1SER CA 1 17.704 15.987 17.370
2GLN CA 2 17.811 16.145 17.712
3ALA CA 3 17.634 16.267 18.034
4GLU CA 4 17.465 16.615 18.002

You can also store the value of the two first columns and then replace them "manually" as seen in delete a column with awk or sed:

$ awk 'FNR==NR {data[FNR]=$1 OFS $2; next} {$0=gensub(/(\s*\S+){2}/,data[FNR],1)}1' f1 f2
1SER CA    1  17.704  15.987  17.370
2GLN CA    2  17.811  16.145  17.712
3ALA CA    3  17.634  16.267  18.034
4GLU CA    4  17.465  16.615  18.002

If you just want to replace certain characters, use substr() to extract those:

$ awk -v start=2 -v len=3 'FNR==NR{data[FNR]=substr($0, start, len); next} {$0=substr($0, 1, 2) data[FNR] substr($0, start+len+1)}1' f2 f1
AABBBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AABBBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AABBBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

That is:

awk -v start=2 -v len=3 \
   'FNR==NR{data[FNR]=substr($0, start, len); next}             # store from the start-th to the (start+len)-th chars
    {$0=substr($0, 1, start) data[FNR] substr($0, start+len+1)} # replace those
    1' f2 f1                                                    # print what was created

Thanks, this solves the current problem but isn't what I was looking for. I don't want to restrict myself to column numbers. I need a solution that will work for a range, irrespective of columns. I'm editing my example to make it harder. — Chem-man17, Sep 09 '16 at 13:44

Ed Morton · Answer 2 · 2016-09-09T14:25:13.227

IF by "columns" and "bytes" you actually mean "characters" then:

$ cat tst.awk
BEGIN {
    split(range,r,/-/)
    repS = r[1]
    repL = r[2] - r[1] + 1
    befL = repS - 1
    aftS = repS + repL
}
NR==FNR { rep[NR] = substr($0,repS,repL); next }
{ print substr($0,1,befL) rep[FNR] substr($0,aftS) }

$ awk -v range='2-4' -f tst.awk file2 file1
1MET     CA    1   1.401   0.040   0.887
2ARG     CA    2   1.708  -0.155   1.002
3ARG     CA    3   1.870  -0.103   0.662
4TYR     CA    4   1.829   0.274   0.695
ABBBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ABBBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ABBBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

$ awk -v range='10-25' -f tst.awk file2 file1
1SER     CA    1  17.704   0.040   0.887
2GLN     CA    2  17.811  -0.155   1.002
3ALA     CA    3  17.634  -0.103   0.662
4GLU     CA    4  17.465   0.274   0.695
AAAAAAAAABBBBBBBBBBBBBBBBAAAAAAAAAAAAAAAAA
AAAAAAAAABBBBBBBBBBBBBBBBAAAAAAAAAAAAAAAAA
AAAAAAAAABBBBBBBBBBBBBBBBAAAAAAAAAAAAAAAAA

The above used a concatenation of your examples as the input files.

Sundeep · Accepted Answer · 2016-09-09T15:34:42.233

Solution with paste and cut

$ paste -d '' <(cut -c1 file1) <(cut -c2-4 file2) <(cut -c5- file1)
1MET     CA    1   1.401   0.040   0.887
2ARG     CA    2   1.708  -0.155   1.002
3ARG     CA    3   1.870  -0.103   0.662
4TYR     CA    4   1.829   0.274   0.695
ABBBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ABBBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ABBBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

With variables:

$ s=10
$ e=25
$ paste -d '' <(cut -c1-$((s-1)) file1) <(cut -c"$s"-"$e" file2) <(cut -c$((e+1))- file1)
1SER     CA    1  17.704   0.040   0.887
2GLN     CA    2  17.811  -0.155   1.002
3ALA     CA    3  17.634  -0.103   0.662
4GLU     CA    4  17.465   0.274   0.695
AAAAAAAAABBBBBBBBBBBBBBBBAAAAAAAAAAAAAAAAA
AAAAAAAAABBBBBBBBBBBBBBBBAAAAAAAAAAAAAAAAA
AAAAAAAAABBBBBBBBBBBBBBBBAAAAAAAAAAAAAAAAA

Way smarter than the accepted answer: runs in constant space. — Kaz, Sep 09 '16 at 19:43

Kaz · Answer 4 · 2016-09-09T19:51:14.877

This little overlong one-liner in TXR Lisp obtains two lazy lists of strings from two files, and then combines them in the desired way into one. The -t option prints the resulting value, a list of strings, as lines.

$ txr -t '[apply mapcar* (ret `@{1 [0]}@{2 [1..4]}@{1 [4..:]}`)
                 (mapcar [chain open-file get-lines] *args*)]' file1 file2
1MET     CA    1   1.401   0.040   0.887
2ARG     CA    2   1.708  -0.155   1.002
3ARG     CA    3   1.870  -0.103   0.662
4TYR     CA    4   1.829   0.274   0.695

In principle, it's a very similar approach to the solution which uses paste to combine multiple cut streams together (thanks to Bash command substitution), except all the streaming is done in one process, using data structures, and file1 is not scanned twice.

The variable *args* refers to the remaining command line arguments, as a list of strings. The inner mapcar maps these strings through the chaining of open-file and get-lines to open them as files, and obtain from each its lines as lazy lists of strings. These two lazy lists of lines are then applied to the mapcar* function, to map their elements in parallel through the anonymous function produced by (ret ...). The ret operator constructs an anonymous function whose parameters are implicitly derived from the @1 and @2 numbered parameters which occur in the body. An interpolating quasi-string is used to do the operation of selecting characters from the left and right strings. @1 would pick the entire left string. The braced syntax @{1 [0]} denotes its first character, and @{2 [1..4]} is slice extraction.

The mapcar* function is the lazy version of mapcar which is important: this prevents us from constructing the entire output in memory before printing it out. The code operates on inputs which are lazy lists, so input takes place as the mapping marches through these lists, and the marching is driven by the consumption of the output list by the -t option. That is to say, the expression instantly returns a lazy list out of mapcar*, and then as that list is printed (thanks to the -t option), the forcing of that lazy list drives the mapping which produces it, which drives the consumption of the lazy input lists, which drives the reading of the source files.

We can see what the expansion of the ret expression looks like:

$ txr -p '(macroexpand (quote (ret `@{1 [0]}@{2 [1..4]}@{1 [4..:]}`)))'
(lambda (#:arg-01-0003
         #:arg-02-0004 . #:rest-0002)
  [identity (progn #:rest-0002
              `@{#:arg-01-0003 [0]}@{#:arg-02-0004 [1..4]}@{#:arg-01-0003 [4..:]}`)])

ret has examined the contents, drilling into the interpolated quasi-string to ferret out the "meta number" parameters. It has noticed that the highest one is @2 and so a two-argument function is generated (with a rest parameter for trailing arguments). The parameters are generated as temporary symbols, and those symbols replace all the occurrences of the numbered variables.

We can prove that mapcar* is lazy, by using it to, say, multiply two infinite lists of increasing integers, and then only take the first ten squares from the result:

$ txr -p '(take 10 [mapcar* * (range 0) (range 0)])'
(0 1 4 9 16 25 36 49 64 81)

score -1 · Answer 5 · answered Sep 09 '16 at 13:49

-1

You could give the

join

command a try: Join the rows together and then cut away the faulty columns.

join requires the files to be sorted.

answered Sep 09 '16 at 13:49

an.dr.eas.k

151
8

Replacing a range with data from another file

5 Answers5