Replace only alphanumeric chars from strings in one file in another

Question

I have file1 with records that I want to find and replace with # in file2 and redirect the output to file3. I want to translate only the alphanumeric characters in file2. With the below code I'm not able to get the expected output. What am I doing wrong?

file_read=`cat file2`
while read line; do
  var=`echo $line | tr '[a-zA-Z0-9]' '#'`
  rep=`echo $file_read | awk "{gsub(/$line/,\"$var\"); print}"`
done < file1
echo file2 > file3

cat file1

2001009
@vanti Finserv Co.
2001009
Fund #1
11:11 - Capital
MS&CO(NY)
American Friends Org, Inc. 12X32
Domain-Name (LLC)
MS&CO(NY)
MS&CO(NY)
Ivy/Estate Rd
E*Trade wholesale

cat file2

<html>
<body>
<hr><br><>span class="table">Records</span><table>
<tr class="column">
 <td>Rec1</td>
 <td>Rec2</td>
 <td>Rec3</td>
 <td>Rec4</td>
 <td>Rec5</td>
 <td>Rec6</td>
 <td>Rec7</td>
 <td>Rec8</td>
</tr>
<tr class="data">
<td>@vanti Finserv Co.</td>
<td>11:11 - Capital</td>
<td>MS&CO(NY)</td>
<td>New York</td>
<td>CDX98XSD</td>
<td>E*Trade wholesale</td>
<td>Domain-Name (LLC)</td>
<td>Ivy/Estate Rd</td>
<td></td>
</tr>
<tr class="data">
<td>@vanti Finserv Co.</td>
<td></td>
<td>MS&CO(NY)</td>
<td>2</td>
<td>2</td>
<td>MS&CO(NY)</td>
<td>MS&CO(NY)</td>
<td>Ivy/Estate Rd</td>
</table>
</body>
</html>

expected output cat file3

<html>
<body>
<hr><br><>span class="table">Records</span><table>
<tr class="column">
 <td>Rec1</td>
 <td>Rec2</td>
 <td>Rec3</td>
 <td>Rec4</td>
 <td>Rec5</td>
 <td>Rec6</td>
 <td>Rec7</td>
 <td>Rec8</td>
</tr>
<tr class="data">
<td>@##### ####### ##.</td>
<td>##:## - #######</td>
<td>##&##(##)</td>
<td>New York</td>
<td>CDX98XSD</td>
<td>#*##### ########</td>
<td>######-#### (###)</td>
<td>###/###### ##</td>
<td></td>
</tr>
<tr class="data">
<td>@##### ####### ##.</td>
<td></td>
<td>##&##(##)</td>
<td>2</td>
<td>2</td>
<td>##&##(##)</td>
<td>##&##(##)</td>
<td>###/###### ##/td>
</table>
</body>
</html>

Please share what you have tried, and what errors do you hit. SO is NOT a "we will just do your task" website/community — Ron, Mar 27 '22 at 08:12
In your last question you asked to only convert the special symbols, now you want to replace alphanumeric characters but you would (if your regular expression wouldn't contain unescaped characters) actually replacing every character in your file, except for `:`, with `#`. Have a look at [your expression on regex101](https://regex101.com/r/9jJxX4/1). The errors get highlighted in red and explained. — mashuptwice, Mar 27 '22 at 08:44
What is `file_read=cat file2` supposed to mean? This sets the environment variable `file_read` to `cat`, then tries to execute `file2` as a program. Did you mean `file_read=$(cat file2)`? But you never use the variable `$file_read`. — Barmar, Mar 27 '22 at 09:21
Don't substitute shell variables directly into the `awk` script. See https://stackoverflow.com/questions/19075671/how-do-i-use-shell-variables-in-an-awk-script — Barmar, Mar 27 '22 at 09:23
You can't have spaces around the `=` in variable assignments like `var =` and `rep =` — Barmar, Mar 27 '22 at 09:24

tripleee · Answer 1 · 2022-03-27T12:19:09.727

You seem to be looking for something like

awk 'NR==FNR {
  regex = $0;
  gsub(/[][(){}|\\*+?.^$]/, "\\\\&", regex);
  a[++n] = regex;

  gsub(/[A-Za-z0-9]/, "#");
  gsub(/&/, "\\\\&");
  b[n] = $0;

  next
}
{ for(i=1;i<=n;++i)
    gsub(a[i], b[i])
} 1' file1 file2 >file3

In brief, we populate the array a with the phrases from file1, and b with the corresponding replacement strings. The condition FNR==NR will be true for the first input file; we then fall through to the rest of the script, which simply replaces any strings from a with the corresponding string from b, and prints all the lines.

The code is complicated somewhat by the escaping of regex metacharacters in a and further by the fact that & in the replacement string needs to be escaped, too (& alone recalls the matched text).

Demo: https://ideone.com/YkAkAZ

You generally want to avoid while read loops in the shell; Awk is much faster and more idiomatic when you want to perform some transformation on all lines in a file.

As a further aside, please try http://shellcheck.net/ before asking for human assistance. Even after you fixed syntax errors pointed out in comments, your attempt contains common beginner errors such as broken quoting.

Thanks for your answer but this doesn't seem to help in case of records like Domain-Name (LLC) or MS&CO(NY) — Roshni, Mar 27 '22 at 10:48
Thanks for the feedback; updated with a more elaborate version with a demo. — tripleee, Mar 27 '22 at 11:35
Perhaps see also https://stackoverflow.com/questions/65538947/counting-lines-or-enumerating-line-numbers-so-i-can-loop-over-them-why-is-this - yours is not an example of that particular antipattern, but the pretzel logic in your attempt has many semblances to several related beginner approaches. — tripleee, Mar 27 '22 at 12:32

tshiono · Accepted Answer · 2022-03-27T12:30:21.613

0

Would you please try the following:

awk '
    NR==FNR {s = $0; gsub("[[:alnum:]]", "#"); a[s] = $0; next}
    {
        if (match($0, ">[^<]+")) {
            str = substr($0, RSTART+1, RLENGTH-1)
            if (str in a) {
                $0 = substr($0, 1, RSTART) a[str] substr($0, RSTART+RLENGTH)
            }
        }
    }
1 ' file1 file2 > file3

It assumes the strings to be replced are enclosed with tags but will work with the shown example.

edited Mar 27 '22 at 12:30

answered Mar 27 '22 at 11:51

tshiono

21,248
2
14
22

1

Thanks for pointing that out.have corrected my post – Roshni Mar 27 '22 at 12:28
You only fixed one of the errors. – tripleee Mar 27 '22 at 12:30

Replace only alphanumeric chars from strings in one file in another

2 Answers2