0

I have a CSV file with multiple before/after values which I am using to find and replace values in another large data file (~200MB).

I initially used a loop reading in each before/after value and sed to implement the find and replace.

The issue is that it's understandably a bit slow, so I wanted to try running all of the find/replace in a single line separated by semi-colons to see if it might be faster by only having to traverse the target data file one time.

So I have two values:

find="ABC"
replace="DEF"

Then I initialized the variable:

cmd=""

and within the loop, I tried to use this command:

cmd="${cmd}s/${find}/${replace}/g;"

The idea is to have everything concatenate into one long string like so:

"s/FIND1/REP1/g;s/FIND2/REP2/g;s/FIND3/REP3/g; ..." And so on

Then I could run the command:

perl -i -p -e ${cmd} TARGET_FILE

The issue is that my output for the cmd is looking really strange:

echo ${cmd}
/DEF/g;ABC

The order is totally messed up, I even tried to set ${cmd} to a string like "test" to see what was going on, and the output doesn't change. Somehow the variable order is getting reversed, and the leading "s" is not showing up.

I tried to use printf instead and got the same results. I tried removing the semicolon, changing the forward-slash, escaping the characters, and various other things but nothing seems to be working. Could someone tell me what is going on with this command and why the strange behavior?

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
Ben C Wang
  • 617
  • 10
  • 19
  • You poorly articulated your problem. Where is a sample of CVS file with `find,replace` pairs? Where is a snippet of `data` file with at least a few lines of data. Are you trying to implement perl `one liner` wrapped into shell script? Or if you want to implement as perl script it would be nice to see your effort in doing so: the code you have tried alredy. – Polar Bear Mar 04 '20 at 18:35
  • For a test you could use `perl -0777 -pe 's/$find/$replace/g' file_name` and if you satisfied with result do in place replacement `perl -0777 -e 's/$find/$replace/g' -i.bak file_name`. – Polar Bear Mar 04 '20 at 18:39
  • Wrapping perl script into shell's loop is not very efficient -- you ask perl to open script file, read it, analyze it, run it, script exits and now cycle repeats in the loop again and again. – Polar Bear Mar 04 '20 at 18:41
  • 1
    It looks like your variable has carriage return characters in it, probably because the script and/or CSV file are in DOS/Windows format (see [here](http://stackoverflow.com/questions/31885409/why-would-a-correct-shell-script-give-a-wrapped-truncated-corrupted-error-messag) and [here](https://stackoverflow.com/questions/39527571/are-shell-scripts-sensitive-to-encoding-and-line-endings)). Convert the files to unix format to avoid trouble. – Gordon Davisson Mar 04 '20 at 19:14
  • Also, double-quote variable references (e.g. `perl -i -p -e "${cmd}" TARGET_FILE` instead of `perl -i -p -e ${cmd} TARGET_FILE`) to avoid the things the shell does to unquoted variable references. I recommend [shellcheck.net](https://www.shellcheck.net) for spotting common mistakes like this. – Gordon Davisson Mar 04 '20 at 19:15
  • @GordonDavisson You were right, the file I received had been created in windows, and I did not know that the ^M caused that type of behavior. I removed the ^M characters and now everything is working fine. Your answer was correct. Considering how much time I spent on this, I will never forget to double check carriage returns. Thank you! – Ben C Wang Mar 04 '20 at 19:47

1 Answers1

0

Doing this in a single string is not scalable. Multiple replacements are also not efficient.

This Perl one-liner reads the csv file with patterns and replacements ("before" and "after" values) into the hash %to. It then constructs the regex $pat by concatenating all the "before" values. Then it reads the file where it replaces "before" with "after" values and prints the result into the output file.

cat > pats.csv <<EOF
FIND1,REP1
FIND2,REP2
FIND3,REP3
EOF

cat > in.txt <<EOF
foo FIND1,FIND2,FIND1
bar FIND2 bar
FIND3
EOF

perl -lpe '
BEGIN {
    %to = map { chomp; split m{,}, $_ }
        do { local @ARGV = q{pats.csv}; <> };
    $pat = join q{|}, keys %to;
    $pat = qr{($pat)};
}
s{$pat}{$to{$1}}gxms;
' in.txt > out.txt

cat out.txt
# Prints this:
foo REP1,REP2,REP1
bar REP2 bar
REP3

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47