3

I have a file containing a list of regular expressions and replacement literal strings in the following format :

OLD_REGEXP_1 NEW_STRING_1
OLD_REGEXP_2 NEW_STRING_2
...

I want to replace all of the strings that match OLD_REGEXP_X with NEW_STRING_X in multiple files *.txt.

I believe that this is a common question and someone should have already done something similar before, but I just couldn't find an existing solution written in bash.

For example :

Tom Thompson
Billy Bill&Ted
goog1e\.com google.com
https?://www\.google\.com https://google.com

Input :

Tom and Billy are visiting http://www.goog1e.com

Expected output :

Thompson and Bill&Ted are visiting https://google.com

The major challenges are :

  • The strings to be replaced are described by POSIX Extended Regular Expressions, not literal, and any character that is not a POSIX ERE metacharacter, including / which is often used as a regexp delimiter by some tools, must be treated as literal.
  • The replacement strings are literal and can contain any literal character, including chars like & and \1 that are often used as backreference metacharacters in replacement strings but must be literal in this case.
  • Replacements must occur in the order they appear in the mapping file so if we have A->B and B->C in that order in the mapping file and A appears in the text file that is to be changed, then the output will contain "C" in place of "A", not "B".
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
kit
  • 535
  • 4
  • 10
  • Is it possible that old and new strings may also contain special characters like `*`, `+`, `[`, `]`, `(`, `)`, `&` etc? – anubhava Jul 31 '18 at 08:33
  • @anubhava, yes, the old strings may also contain `?`, `!` – kit Jul 31 '18 at 08:37
  • 1
    So are they regular expressions, or literal strings? In the latter case, you need to backslash or otherwise neutralize `*`, `[` etc (but not in particular `!` or, depending on the `sed` dialect, necessarily even `?`) – tripleee Jul 31 '18 at 08:40
  • I think we can always treat the old string as regular expression. But I am not so sure how `/` works in `sed` or `awk`, so I am escaping them anyway. – kit Jul 31 '18 at 11:56
  • Each existence of the old string should be replaced by the new string, if the an new string matches another old string later, it should be handled in the same way. – kit Jul 31 '18 at 12:03
  • Fortunately, all of the new strings are just strings, no matching groups from the old string – kit Jul 31 '18 at 12:05

2 Answers2

1

You can convert your substitution list file into a sed script file, then let sed do the job for you.

give this a try with gnu sed:

sed -i -f <(sed -r 's/^(\S*) (.*)/s@\1@\2@/g' listfile) *.txt
Kent
  • 189,393
  • 32
  • 233
  • 301
  • I am using macOS, it says `sed: illegal option -- r` – kit Jul 31 '18 at 08:19
  • 1
    change to `-E` might help. – CWLiu Jul 31 '18 at 08:29
  • Can you explain `<(sed -r 's/^(\S*) (.*)/s@\1@\2@/g' listfile)`? I would like to know how it works. – kit Jul 31 '18 at 08:31
  • `\S` isn't properly portable either, though it appears to work in Mac OS `sed`. – tripleee Jul 31 '18 at 08:31
  • @kit run that (without the `<(` and final `)` decorations) to see what it does. It generates a `sed` script for the outer `sed -f` – tripleee Jul 31 '18 at 08:32
  • I tried to run `sed -E 's/^(\S*) (.*)/s@\1@\2@/g' listfile`, it prints the content of `listfile`. – kit Jul 31 '18 at 08:34
  • Not exactly. It prints `s@OLD_STRING_1@NEW_STRING_1@`, `s@OLD_STRING_2@NEW_STRING_2@` etc – tripleee Jul 31 '18 at 08:38
  • I changed it to `'s/^(.*) (.*)/s@\1@\2@/g'`, then I got something like `s@OLD_STRING_1@NEW_STRING_1@` now. – kit Jul 31 '18 at 08:43
  • So basically, I convert the `listfile` into a sed script file, a list of `s@OLD_STRING_1@NEW_STRING_1@`, then use it with `-f` option and input `*.txt` with `-i` option, is it how the command works? – kit Jul 31 '18 at 08:47
  • I saved the output from `sed -E 's/^(.*) (.*)/s@\1@\2@/g' listfile` into a file `sed`, but when I run `sed -i -f <(cat sed) *.txt`, I got error `sed: 1: "/dev/fd/63": invalid command code f`. – kit Jul 31 '18 at 09:02
  • @kit you should read some info/man page of sed, no matter gnu sed or bsd sed. also, read something to understand "process substitution" – Kent Jul 31 '18 at 09:06
  • @kit I don't have mac, so I wrote in my answer gnu sed. with bsd sed, you may need `-E` however I am not 100% sure since I cannot test. – Kent Jul 31 '18 at 09:07
  • You need `sed -i '' -f sed.scr *.txt` on Mac OS; notice in particular the mandatory empty argument to `-i`. Saving to a temporary file `sed.scr` is unnecessary, though. – tripleee Jul 31 '18 at 09:20
  • 1
    That will fail when the old or new strings contain `@`s (try including an email address as old string or new string) and when the new string contains `&` or `\`. It also relies on bash for process substitution so you should say that. It also relies on GNU sed for `-r` and that particular `-i` syntax (which you did mention) but you could tweak that syntax to also work with OSX sed or just escape the brackets so you don't need EREs and remove the `-i` and -r/-E and then it'd behave the same in any sed. – Ed Morton Jul 31 '18 at 13:27
1

Given what you've told us so far and considering everything said in comments as well as what's in the question and all of the possible strings I can think of that aren't currently included in your example but can occur (excluding strings that contain spaces - you'd have to tell us how to identify old vs new in mapfile to handle that), it sounds like this is what you need:

$ cat mapfile
Tom Thompson
Billy Bill&Ted
goog1e\.com google.com
https?://www\.google\.com https://google.com

$ cat textfile
Tom and Billy are visiting http://www.goog1e.com

awk '
NR==FNR {
    old[NR] = $1
    gsub(/&/,RS,$2)
    new[NR] = $2
    next
}
{
    for (i=1; i in old; i++) {
        gsub(old[i],new[i])
    }
    gsub(RS,"\\&")
    print
}
' mapfile textfile
Thompson and Bill&Ted are visiting https://google.com

The above treats the "old string" as a regexp, treats the "new string" as a literal string with no backreferences and applies the replacements strictly in the order defined in your input file.

The first gsub() converts every & in the replacement string to a Record Separator (which cannot be present since we're operating WITHIN a Record) so that the 2nd gsub() will not treat &s in the new string like a backreference, and then the 3rd gsub() just puts the RSs back to &s.

The above will work using any awk in any shell on any UNIX system.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185