1

I've been trying to do bulk find and replace on two text files using a csv. I've seen the questions that SO suggests, and none seem to answer my question.

I've created two variables for the two text files I want to modify. The csv has two columns and hundreds of rows. The first column contains strings (none have whitespaces) already in the text file that need to be replaced with the corresponding strings in same row in the second column.

As a test, I tried the script

#!/bin/bash

test1='long_file_name.txt'
find='string1'
replace='string2'

sed -e "s/$find/$replace/g" $test1 > $test1.tmp && mv $test1.tmp $test1

This was successful, except that I need to do it once for every row in the csv, using the values given by the csv in each row. My hunch is that my while loop was used wrongly, but I can't find the error. When I execute the script below, I get the command line prompt, which makes me think that something has happened. When I check the text files, nothing's changed.

The two text files, this script, and the csv are all in the same folder (it's also been my working directory when I do this).

#!/bin/bash

textfile1='long_file_name1.txt'
textfile2='long_file_name2.txt'

while IFS=, read f1 f2
do
    sed -e "s/$f1/$f2/g" $textfile1 > $textfile1.tmp && \
         mv $textfile1.tmp $textfile1
    sed -e "s/$f1/$f2/g" $textfile2 > $textfile2.tmp && \
         mv $textfile2.tmp $textfile2
done <'findreplace.csv'

It seems to me that this code should do what I want it to do (but doesn't); perhaps I'm misunderstanding something fundamental (I'm new to bash scripting)?

The csv looks like this, but with hundreds of rows. All a_i's should be replaced with their counterpart b_i in the next column over.

a_1 b_1
a_2 b_2
a_3 b_3

Something to note: All the strings actually contain underscores, just in case this affects something. I've tried wrapping the variable name in braces a la ${var}, but it still doesn't work.

I appreciate the solutions, but I'm also curious to know why the above doesn't work. (Also, I would vote everyone up, but I lack the reputation to do so. However, know that I appreciate and am learning a lot from your answers!)

suhlee
  • 13
  • 4
  • possible duplicate of [passing variable containing special chars to sed in bash](http://stackoverflow.com/questions/22093750/passing-variable-containing-special-chars-to-sed-in-bash) – NeronLeVelu Mar 16 '15 at 07:31
  • 1
    can you paste a sample of the csv file you're passing ?? otherwise your code should work as it is..on another note: you can use `sed -i "s/$f1/$f2/g" $textfile1` to avoid creating an intermidiate .tmp file. with the option `-i`sed will find, replace and overwrites the modified file – sa77 Mar 16 '15 at 08:58
  • @sa77 except that `sed -i` does create an intermediate text file! In addition, it does not work in all sed but is often a syntax error. – William Pursell Mar 16 '15 at 11:21
  • It turns out there was a problem with the csv I created using TextEdit; after creating the csv again using Sublime, the script works. Thanks again, everybody! I knew about the `-i` tag, but preferred to use the solution given by the second answer [here](http://stackoverflow.com/questions/5171901/sed-command-find-and-replace-in-file-and-overwrite-file-doesnt-work-it-empties) – suhlee Mar 17 '15 at 03:23
  • For anyone who might be interested, the original code given above works fine for the task I described. – suhlee Mar 17 '15 at 03:30

2 Answers2

1

If you are going to process lot of data and your patterns can contain a special character I would consider using Perl. Especially if you are going to have a lot of pairs in findreplace.csv. You can use following script as filter or in-place modification with lot of files. As side effect, it will load replacements and create Aho-Corrasic automaton only once per invocation which will make this solution pretty efficient (O(M+N) instead of O(M*N) in your solution).

#!/usr/bin/perl
use strict;
use warnings;
use autodie;

my $in_place = ( @ARGV and $ARGV[0] =~ /^-i(.*)/ )
    ? do {
    shift;
    my $backup_extension = $1;
    my $backup_name      = $backup_extension =~ /\*/
        ? sub { ( my $fn = $backup_extension ) =~ s/\*/$_[0]/; $fn }
        : sub { shift . $backup_extension };
    my $oldargv = '-';
    sub {
        if ( $ARGV ne $oldargv ) {
            rename( $ARGV, $backup_name->($ARGV) );
            open( ARGVOUT, '>', $ARGV );
            select(ARGVOUT);
            $oldargv = $ARGV;
        }
    };
    }
    : sub { };

die "$0: File with replacements required." unless @ARGV;
my ( $re, %replace );
do {
    my $filename = shift;
    open my $fh, '<', $filename;
    %replace = map { chomp; split ',', $_, 2 } <$fh>;
    close $fh;
    $re = join '|', map quotemeta, keys %replace;
    $re = qr/($re)/;
};

while (<>) {
    $in_place->();
    s/$re/$replace{$1}/g;
}
continue {print}

Usage:

./replace.pl replace.csv <file.in >file.out

as well as

./replace.pl replace.csv file.in >file.out

or in-place

./replace.pl -i replace.csv file1.csv file2.csv file3.csv

or with backup

./replace.pl -i.orig replace.csv file1.csv file2.csv file3.csv

or with backup whit placeholder

./replace.pl -ithere.is.\*.original replace.csv file1.csv file2.csv file3.csv
Hynek -Pichi- Vychodil
  • 26,174
  • 5
  • 52
  • 73
  • Thank you for the code! I also just want to know what I'm doing wrong, but your answer was super useful. Will upvote when I get enough reputation. – suhlee Mar 16 '15 at 22:33
0

You should convert your CSV file to a sed.script with the following command:

cat replace.csv | awk -F, '{print "s/" $1 "/" $2 "/g";}' > sed.script

And then you will be able to do a one pass replacement:

sed -i -f sed.script longfilename.txt

This will be a faster implementation of what you wanna do.

BTW, sorry, but I do not understand what is wrong with your script which should work except if your CSV file has more than 2 columns.

Adam
  • 17,838
  • 32
  • 54
  • I've heard of awk (and gawk), but have never used it. Looking at sed.script in a text editor, it only contains one line; going by the csv format in the question, the line is `s/a_1/b_1 a_2/g`. `a_i` and `b_i` actually do contain underscores; could that be why? (Amended original post.) – suhlee Mar 16 '15 at 22:35
  • Also thanks for confirming the validity of my script! – suhlee Mar 17 '15 at 04:38
  • You should have multiple lines in your result files. You can add \n after /g to be sure of it... – Adam Mar 17 '15 at 06:41
  • It actually ended up being an issue with the csv, though your code is definitely handy (and works). I'll definitely try awk the next time I need to do a task like this. Thanks again! – suhlee Mar 18 '15 at 03:23
  • Thanks for the feedback. You might accept it as an answer or give it a +1 if you want my answer to benefit a next viewer. Thanks. – Adam Mar 18 '15 at 06:47