AWK: Reading all lines & manipulating one file ENTIRELY based each line of another file

Question

I have two input files:

File1.txt:

Name    Latin-small    Roman        Latin-caps #header, not to be processed
F0,        a,              I,            A
F1,        b,              II,           B
F2,        c,              III,          C
F3,        d,              IV,           D

File2.txt:

Lorem ipsum
Roman here.
LCaps here.
LSmall here.
Lorem ipsum

I get assign values of R, LC and LS from each line of File1.txt (line 6 of script.sh).
Generate folders named Fx, where x=0, 1, 2, 3,... using File1.txt (line 7 of script.sh).
Individual files named Fx.txt, generated using File2.txt has to be placed in those folder (line 7 of script.sh).
Now, after reading a signle line of File1.txt, it should read (line 7 of script.sh) & modify the whole File2.txt looking at the keys. <- this is where I cannot make it work, it reads one line of File2.txt for each line of File1.txt.
The contents of those files are basically copies File.txt, except the values of here, modified using the keys Roman ($3 of File1.txt), LCaps ($4 of File1.txt) and LSmall ($2 of File1.txt) for each Fx.txt in each directory, using the values assigned in the first step from File1.txt (line 9-17 of script.sh).

How to get the following output in respective folders (e.g. the output file in Folder F2), using awk:

cat F0/F0.txt

 Lorem ipsum
 Roman               I.
 LCaps           A.
 LSmall         a.
 Lorem ipsum

or,

cat F3/F3.txt

 Lorem ipsum
 Roman               IV.
 LCaps           D.
 LSmall         d.
 Lorem ipsum

or,

cat F2/F2.txt

 Lorem ipsum
 Roman III.
 LCaps C.
 LSmall c.
 Lorem ipsum

More info: File1 is ~300lines, for each line (except the header), one file is to be created in each folder. File2 is ~200lines. Each of the phrases Roman or LSmall or LC randomly occur in certain lines of File2.txt, but not more than one in one line. These are the keys for modyfying values in `

Thanks in advance! This question is a part of a bigger workflow.

EDIT2: trial code

script.sh

awk 'BEGIN {FS=","}
 {
  if ($1 !~ "F")
    {}
  else if ($1 ~ "F")
    {LS = $2; R = $3; LC = $4;
    system("mkdir "$1); filename=$1"/"$1".txt";
    {(getline < "File2.txt");
      {
        if ($0 ~ "Roman")
          {gsub("here",R); print >> filename;}
        else if ($0 ~ "LSmall")
          {gsub("here",LS); print >> filename;}
        else if ($0 ~ "LCaps")
          {gsub("here",LC); print >> filename;}
        else
          {print >> filename;}
      }
    }
    }
  }
' File1.txt

I'm getting folder and file structure as I need (file Fx.txt in Folder Fx, where x = 0, 1, 2, ...), but content of these files are:

cat F0/F0.txt

Lorem ipsum

cat F1/F1.txt

Roman               II.

cat F2/F2.txt

LCaps           C.

cat F3/F3.txt

LSmall         d.

The key is to make awk read the entire file2.txt, while reading each line of file1 and making modifications and placing the new files in respective folders.

The goal is somewhat similar to [https://stackoverflow.com/questions/57234705/reading-two-files-by-comparing-all-lines-of-file-2-with-each-line-of-file-1] (reading two files by comparing all lines of file 2 with each line of file 1), but here it needs to be done by AWK. — massisenergy, Dec 30 '19 at 12:35
I don't understand how the output is generated and where should it be located. Why `file2`? Why folder `F2` not `F0`? Why `Greek III`, why `III` and not `I`? — KamilCuk, Dec 30 '19 at 12:50
@https://stackoverflow.com/users/1765658/f-hauri, added the output I'm getting with my code. Corrected the mistakes in words. — massisenergy, Dec 30 '19 at 13:28
@https://stackoverflow.com/users/9072753/kamilcuk, I added more details, hoping it would clarify. It's bit complex too, to convey properly from my side. Please feel free to ask for more information. — massisenergy, Dec 30 '19 at 13:28
Just @user pings a user, there is no need to copy the profile URL and in fact that probably prevents the ping from working. But there is probably no need to ping these users now anyway. — tripleee, Dec 30 '19 at 13:36
So in other words you have two keys which together select what to replace "here" with. Your attempt doesn't seem to make any attempt at reading the mapping file but this should not be hard in Awk. (Not trivial because there is no way to make an array point to another array, but entirely doable.) — tripleee, Dec 30 '19 at 13:40
@triplee, thanks. Yes, agree with "So in other words you have two keys which together select what to replace "here" with.", but didn't get "doesn't seem to make any attempt at reading the mapping file" — massisenergy, Dec 30 '19 at 13:44
Your code looks for the key anywhere on a line but the examples all have the key in the first field. Is the latter a correct observation or should an implementation attempt to find the key anywhere? The headers don't exactly correspond to the actual keys -- is this intentional? If you have more than these few keys, how would we map from the header line to an actual key? — tripleee, Dec 30 '19 at 13:46
Your code reads a single file, but your problem statement has two input files. But I see now that you have hard-coded the second file into the Awk script. My bad. — tripleee, Dec 30 '19 at 13:46
@triplee, Not exactly. In `script.sh`, the first 7lines takes input from and operates on `file1.txt` (and does two more things - 1. creates a folder namesake to column one or $1, 2. assigns some values which are to be replaced at particular lines in `file2.txt`), then line 8-21 works on the other file using `getline < "File2.txt"`. The only problem is for each line read in `file1.txt`, it reads only one line from `file2.txt`, whereas I want it to read the entire `file2.txt`. Do you see that reflected, if you check my output? — massisenergy, Dec 30 '19 at 13:54
Once this gets reopened we can discuss how to do that properly (maybe google `NR==FNR`). In the meantime, can you please clarify whether the token is always the first word on each line or a pattern which should be matched anywhere? — tripleee, Dec 30 '19 at 13:59
Okay, thanks. Getting some hope for a question which is hard for me to make others understand. The first word (or column) of `file1.txt` is the most important piece of information, as it 1. creates the folders where the new files (`Fx.txt`) are to be placed. 2. then, from the other columns in the same line, provides the values, which are to be placed at particular lines of `Fx.txt`, the key for which are in the phrases such as `LSmall` (see line 12-13 of `script.sh`), that exist in random places of `file2.txt`. Does it clarify your query? — massisenergy, Dec 30 '19 at 14:10
"Random places" sounds like "anywhere on a line" then. Still, do we need to hard-code a mapping e.g. from the heading line identifier *Latin-small* to the apparently corresponding pattern `LSmall`? And could you please [edit] the question to provide these details so that they are not hidden down here in the comments? — tripleee, Dec 30 '19 at 15:12

tripleee · Accepted Answer · 2019-12-31T06:45:00.970

Like you discovered, Awk can really only process one line at a time. But we can turn things around and read the input file into memory, then loop over its lines repeatedly as we read the other file.

Your example has a comma and a space between the items in file1.txt but I assumed this is not a hard requirement, and so this script expects tab-delimited input instead.

awk -F "\t" 'BEGIN { split(":LSmall:Roman:LCaps", k, /:/) }
    NR==FNR { a[NR] = $0; n=NR; next }
    FNR==1 { next }  # skip header
    {
        system("mkdir "$1)
        filename=$1"/"$1".txt"        
        for(i=1; i<=n; i++) {
            line = a[i]
            for (j=2; j<=NF; ++j) {
                if (line ~ k[j]) {
                    gsub(/here/, $j, line)
                    break
                }
        }
        print line >>filename }
    }' file2.txt file1.txt

The BEGIN block initializes an array with substitution key names k. To keep it in sync with the fields in file1.txt, the first item k[1] is empty (it doesn't specify a substitution key).

When NR==FNR we are reading the first input file. We simply collect its lines into the array a.

When we fall through, we are reading the second file, which is the mapping with directory names and substitutions. For each input line, we loop over all the lines in a and perform any substitution specified in the fields in the current line (as soon as one is found, we consider ourselves done. Maybe you want to change this so that multiple keys can trigger on the same line) and finally print the result to the specified output file.

You'll notice how we pull the first field and loop over the subsequent fields, looking up their corresponding key in k by index.

Demo: https://ideone.com/syTv99

If you want to do this on hundreds of files, perhaps refactor some or all of the surrounding loop out into a shell script and concentrate on the substitution actions in the Awk script. The shell can easily loop over the data in file1.txt just as well, which will simplify the Awk script somewhat and make the overall process easier to understand.

# Trim the obnoxious header
tail -n +2 file1.txt |
while read -r directory LSmall Roman LCaps; do
    mkdir "$directory"
    awk -v LSmall="$LSmall" -v Roman="$Roman" -v LCaps="$LCaps" '
        BEGIN { split("LSmall:Roman:LCaps", k, /:/)
            split(LSmall ":" Roman ":" LCaps, r, /:/) }
        {
            for (j=1; j<=3; ++j)
                if ($0 ~ k[j]) {
                    gsub(/here/, r[j])
                    break
                }
        }1' file2.txt >"$directory"/"$directory".txt
done

Demo: https://ideone.com/RUhsUS

AWK: Reading all lines & manipulating one file ENTIRELY based each line of another file

1 Answers1