How to do minimal disk I/O on 'search and replace' of 20k terms on 100 text files (each with 50k lines)

Question

I would like to do a ‘search and replace’ on about 100 data files with unstructured text. Each of the files are about 50 MB in size and 50k lines. There are about 20k terms to do search and replace, which are stored in another file; 'terms_list.csv', a CSV format file with three columns COL1, COL2, and COL3. I need to search for words in COL1 and COL2 of 'terms_list.csv' in each of the 100 data files and replace with the corresponding word in COL3, if either of the words are found.

With my basic knowledge of shell scripting, I wrote the following shell script using AWK/SED to loop. It reads lines one by one from 20k lines ‘terms_list.csv’ and search for COL1 and COL2 in each of the 100 files, and if found, will replace with COL3.

for DATA_FILE in $(ls text_data_file_*.csv) #Data files (100 files) with 50k lines; contain terms in COL1 and COL2 of terms_list.csv
do
   while read -r line;  
       do
           x=$(echo $line | awk -F',' '{print $1}'); \
           y=$(echo $line | awk -F',' '{print $2}'); \
           z=$(echo $line | awk -F',' '{print $3}'); \
           echo "File: " $DATA_FILE " x: "$x "|" "y: "$y "|" "z: "$z ; \
           sed -i "s/$x/$z/;s/$y/$z/g" $DATA_FILE
       done < terms_list.csv #20k lines in CSV format; each with search terms COL1,COL2, and replace term COL3
done

I am positive that there is a better/efficient code than above to accomplish the task as this require a lot of disk read/write. Any suggestions to improve? If there are better tools (perl/python) for this task, could you please give me some suggestions/directions to look at?

Below are a sample data for both of the files:

‘text_data_file_0001.csv’: One of the 100 data files, ‘text_data_file_0001.csv’ contain unstructured data as below which contain ‘TermFull’ and ‘TermAbbreviated’ among the text. [size of each file is about 50 MB and 50k lines]

ID000001,Mangifera indica, commonly known as mango, is a species of flowering plant in the sumac and poison ivy family Anacardiaceae. M. indica is a popular fruit in India. 
ID000002,Oryza sativa, commonly known as Asian rice, is the plant species most commonly referred to in English as rice. O. sativa contains two major subspecies: the sticky, short-grained japonica or sinica variety, and the nonsticky, long-grained indica rice variety.

'terms_list.csv' file: The search terms 'TermFull' and 'TermAbbreviated', and replace term 'TermJoined' are stored in ‘terms_list.csv’ contains 20k lines as below
```
TermFull,TermAbbreviated,TermJoined
Mangifera indica,M. indica,Mangiferaindica
Oryza sativa,O. sativa,Oryzasativa
```

Required output file ‘text_data_file0001.csv’ is as below with ‘TermFull’ and ‘TermAbbreviated’ replaced with ‘TermJoined’

ID000001,Mangiferaindica, commonly known as mango, is a species of flowering plant in the sumac and poison ivy family Anacardiaceae. Mangiferaindica is a popular fruit in India. 
ID000002,Oryzasativa, commonly known as Asian rice, is the plant species #most commonly referred to in English as rice. Oryzasativa contains two major subspecies: the sticky, short-grained japonica or sinica variety, and the nonsticky, long-grained indica rice variety.

Whenever you're running a `while read` loop (or, really, any other loop) in shell, if you care about efficiency and are iterating over lots of items or lines, you **never** want to either fork or use any external tool inside that loop. That means no `$(...)`, no `awk`, no pipelines, no `sed`. (Running a single copy of `awk` that loops over all your content, as in Ed's answer, is great; running a separate copy of `awk` *per line of content* is horrible; a separate copy per column in that line is even worse). — Charles Duffy, May 15 '19 at 19:46
BTW, `while IFS=, read -r x y z _; do` would avoid the need for those three `awk` invocations, though it would still leave you with a `sed -i` per line, which is still a serious inefficiency. — Charles Duffy, May 15 '19 at 19:49
Thanks @CharlesDuffy. I am learning a lot reading and trying to digest your recommendations and Ed's solution. I need to work on my rudimentary scripting skills. — Insilico, May 15 '19 at 20:00
As another aside, `echo "File: $DATA_FILE x: $x | y: $y | z: $z"` -- passing everything to `echo` as a single string -- is actually less buggy than the current code; see [BashPitfalls #14](http://mywiki.wooledge.org/BashPitfalls#echo_.24foo) or the http://shellcheck.net/ wiki page [SC2086](https://github.com/koalaman/shellcheck/wiki/SC2086) (btw, running your code through shellcheck before asking SO questions is always a good idea). — Charles Duffy, May 15 '19 at 20:04
Thank you Charles for the comment and shellcheck.net site suggestion. Looks nice. I have bookmarked. — Insilico, May 15 '19 at 20:13
The desired output contains `#most` on the second line (in `the plant species #most commonly referred to`), is that at typo? The rules don't seem to insert any `#` anywhere. — Benjamin W., May 15 '19 at 22:12
@BenjaminW., yes, # was a typo. I have updated the post. Thanks for noting! — Insilico, May 24 '19 at 13:47
I recommend you post your solution as an answer instead of adding it to the question. — Benjamin W., May 24 '19 at 13:56
@oguzismail, I respectfully disagree. As you can see from the comments of EdMorton, CharlesDuffy, and BenjaminW., many useful information emerged for SO users like me — Insilico, May 24 '19 at 13:57
@BenjaminW. I have added as answer. Still learning the SO style and etiquette :) — Insilico, May 24 '19 at 14:04

Ed Morton · Answer 1 · 2019-05-15T20:18:26.737

Here's the complete approach (i.e. no surrounding shell loop or anything else required), using GNU awk for "inplace" editing:

awk -i inplace -F, '
NR==FNR { if (NR>1) { map[$1]=$3; map[$2]=$3 } print; next }
{
    for (term in map) {
        gsub(term,map[term])
    }
    print
}
' terms_list.csv text_data_file_*.csv

Massage to suit. If looks like your terms_list file might contain RE metachars, for example, so you should think about whether or not you want to use them in a regexp, as you were doing in sed and so we are doing above with gsub(), or use a string operation like with index() and substr() instead of gsub() and also how you want to handle partial matches and/or cases where the first replacement creates a term that didn't previously exist which is then matched by the second replacement and so on.

I suspect something like this (untested) would be both flexible and robust enough for your needs (it'll certainly be far more robust than the sed script you were running and literally orders of magnitude faster than your shell loop + sed approach):

awk -i inplace -F, '
NR==FNR {
    orig = $0

    if (NR > 1) {
        gsub(/[^^]/,"[&]",$1)
        gsub(/\^/,"\\^",$1)

        gsub(/[^^]/,"[&]",$2)
        gsub(/\^/,"\\^",$2)

        gsub(/&/,"\\&",$3)

        map["\\<"$1"\\>"] = $3
        map["\\<"$2"\\>"] = $3
    }

    print orig
    next
}
{
    for (term in map) {
        gsub(term,map[term])
    }
    print
}
' terms_list.csv text_data_file_*.csv

The gsubs are escaping any metachars so every char in the original and replacement text gets treated literally but then we're adding word boundaries around the whole of every original term to avoid partial matches.

Thank you Ed Morton for your expert guidance. I am trying to understand the code and run it on my data. As I said to Charles above, I need to work on my rudimentary scripting skills. — Insilico, May 15 '19 at 20:04
You're welcome, feel free to ask questions (after a glance at the awk man page of course :-) ). The gsub()s at the start are explained where I set escOld and escNew in my answer at https://stackoverflow.com/a/29626460/1745001 (there it's sed but it's the same concept of turning RE metachars into literal chars). — Ed Morton, May 15 '19 at 20:04

Benjamin W. · Accepted Answer · 2019-05-16T14:42:02.080

1

You could use sed to build a sed script from terms_list.csv:

sed '1d;s/,/|/;s|,|/|;s|.*|s/&/g|' terms_list.csv

Which works as follows:

1d           # Skip the first line
s/,/|/       # Replace the first comma with a pipe
s|,|/|       # Replace the second comma with a slash
s|.*|s/&/g|  # Wrap each line in s/ and /g

and has this output:

$ sed '1d;s/,/|/;s|,|/|;s|.*|s/&/g|' terms_list.csv
s/Mangifera indica|M. indica/Mangiferaindica/g
s/Oryza sativa|O. sativa/Oryzasativa/g

Now, we use this output to run sed -i (requires GNU sed for in-place editing) on all the files we want to change:

sed '1d;s/,/|/;s|,|/|;s|.*|s/&/g|' terms_list.csv | sed -i -Ef- text_data_file_*.csv

-E enables extended regular expressions so we can use | for alternation
-f- reads the sed commands from standard input

The first command could be made a bit more robust in terms of explicit word boundaries to avoid substring matches:

$ sed '1d;s/,/|/;s|,|)\\b/|;s|.*|s/\\b(&/g|' terms_list.csv
s/\b(Mangifera indica|M. indica)\b/Mangiferaindica/g
s/\b(Oryza sativa|O. sativa)\b/Oryzasativa/g

where \b denotes a word boundary (also a GNU sed extension).

If the input contains any regex metacharacters, we have to escape them all, so the first command would become something like this:

sed '1d;s/[][*+{}()/\|&^$.?]/\\&/g;s/,/|/;s|,|)\\b/|;s|.*|s/\\b(&/g|' terms_list.csv

The important addition here is the first substition that takes care of metacharacters by escaping them with a backslash:

s/[][*+{}()/\|&^$.?]/\\&/g

So for a worst case scenario where terms_list.csv contains something like

a[abc]*x+\1{2}|-(o).^$?/\a,other,abc&\1def

the generated command would look like

s/\b(a\[abc\]\*x\+\\1\{2\}\|-\(o\)\.\^\$\?\/\\a|other)\b/abc\&\\1def/g

There is a case that still isn't covered by this: if the input CSV file contains commas within one of the fields, you have to resort to a CSV parser for pre-processing. Commas are ignored when escaping regex metacharacters in the third solution.

edited May 16 '19 at 14:42

answered May 15 '19 at 22:28

Benjamin W.

46,058
19
106
116

Make sure to escape any regexp metachars and backreferences too, see https://stackoverflow.com/q/29613304/1745001. – Ed Morton May 15 '19 at 23:17
@EdMorton You just want me to admit that sed is in over its head with such a problem ;) I admit it! Maybe a disclaimer what would break this is in order... – Benjamin W. May 16 '19 at 13:28
I actually just want the OP to be comparing apples when she looks at the scripts and have something that works somewhat robustly in both cases. She definitely has RE metachars in here terms_list.csv as we can see `.`s in there and I'm sure she'd want them treated literally and I'm also fairly sure she'd have no idea they weren't being treated literally by a script if it wasn't pointed out :-)! – Ed Morton May 16 '19 at 13:34
@EdMorton your point was on spot. However, I am curious, how did you arrive at the conclusion that I am a female or male? ;) – Insilico May 24 '19 at 14:13
I haven't arrived at any conclusion about your gender, I just had to use **some** gender pronoun as `it` seems a tad impersonal :-)! – Ed Morton May 24 '19 at 15:02
@EdMorton I like "they" if I don't know or want to assume. – Benjamin W. May 24 '19 at 15:06
I prefer defaulting to `she` because too often we assume male for many roles and IMHO it's good to shake that assumption whenever possible. I guarantee if I had said `he` in my post I'd never have heard about it if the OP were a woman, but frequently when I use `she` and the OP is a man I get "why did you think I'm a woman?". Gotta break those stereotypes :-). – Ed Morton May 24 '19 at 15:07
@EdMorton See also books by Michael W. Lucas, or the Filfre blog :) – Benjamin W. May 24 '19 at 15:18
@EdMorton I use 'OP' all the time. I would be curious in either cases; whether you use 'he' or 'she'. But you have got a good point. – Insilico May 24 '19 at 17:47

Insilico · Answer 3 · 2019-05-24T15:22:47.223

UPDATE on 2019-05-24:

After a bit of trial and error, I liked the approach by BenjaminW. However, I also adapted the approach for perl and found that perl (v5.22.1) outperformed sed (GNU sed 4.2.2) in this case. (With a very similar code--; needed at the end of pattern--, perl was roughly three times faster than sed. Please see the time command output. The fie outputs had similar word count results) I have posted the sed and perl codes below which worked for my current need.

#!/bin/bash
##################################################
##terms_list.csv content: About 20k lines
#TermFull,TermAbbreviated,TermJoined
#Mangifera indica,M. indica,Mangiferaindica
#Oryza sativa,O. sativa,Oryzasativa

## SCRIPT 1: sed
substitution_with_sed(){
    #First generate the substitution pattern script
    sed '1d;s/[][*+{}()/\|&^$.?]/\\&/g;s/,/|/;s|,|)\\b/|;s|.*|s/\\b(&/g|' terms_list.csv > terms_list.sed
    #1d           # Skip the first line; which is the CSV header terms
    #s/,/|/       # Replace the first comma with a pipe
    #s|,|/|       # Replace the second comma with a slash
    #s|.*|s/&/g|  # Wrap each line in s/ and /g
    #s/[][*+{}()/\|&^$.?]/\\&/g #Escape any regex metacharacters with a backslash
    ##'terms_list.sed' file content
    #s/\b(Mangifera indica|M. indica)\b/Mangiferaindica/g
    #s/\b(Oryza sativa|O. sativa)\b/Oryzasativa/g

    for DATA_FILE in ./DIR/DATA_CSV_FILE*.csv; # About 100k DATA_CSV_FILE*.csv files
    do
        FILE="$(basename $DATA_FILE)"
        echo "Running SED on $DATA_FILE and $FILE"
        echo "sed -E -f terms_list.sed < $DATA_FILE > sed-out-$FILE"
        time sed -E -f terms_list.sed < $DATA_FILE > sed-out-$FILE
        #-E enables extended regular expressions so we can use | for alternation
        #-f- reads the sed commands from standard input
        # # real    25m55.369s
        # # user    25m54.976s
        # # sys     0m0.336s
    done
}

## SCRIPT 2: perl
substitution_with_perl(){

#First generate the substitution script
sed '1d;s/[][*+{}()/\|&^$.?]/\\&/g;s/,/|/;s|,|)\\b/|;s|.*|s/\\b(&/g;|' terms_list.csv > terms_list.perl
    ##'terms_list.perl' file content
    #s/\b(Mangifera indica|M. indica)\b/Mangiferaindica/g;
    #s/\b(Oryza sativa|O. sativa)\b/Oryzasativa/g;

    for DATA_FILE in ./DIR/DATA_CSV_FILE*.csv;
    do
        FILE="$(basename $DATA_FILE)"
        echo "Running PERL on $DATA_FILE and $FILE"
        echo "perl -p terms_list.perl < $DATA_FILE > perl-out-$FILE"
        time perl -p terms_list.perl < $DATA_FILE > perl-out-$FILE
        ## Read substitution pattern command from file with -p flag
        # # real    0m8.120s
        # # user    0m8.072s
        # # sys     0m0.044s
    done
}
#####################################################

##Call functions
substitution_with_sed
substitution_with_perl

#Raw data
#ID000001,Mangifera indica, commonly known as mango, is a species of flowering plant in the sumac and poison ivy family Anacardiaceae. M. indica is a popular fruit in India. 
#ID000002,Oryza sativa, commonly known as Asian rice, is the plant species most commonly referred to in English as rice. O. sativa contains two major subspecies: the sticky, short-grained #japonica or sinica variety, and the nonsticky, long-grained indica rice variety.

#Desired processed output data in 'sed-out-$FILE'/'perl-out-$FILE' file content
#ID000001,Mangiferaindica, commonly known as mango, is a species of flowering plant in the sumac and poison ivy family Anacardiaceae. Mangiferaindica is a popular fruit in India. 
#ID000002,Oryzasativa, commonly known as Asian rice, is the plant species most commonly referred to in English as rice. Oryzasativa contains two major subspecies: the sticky, short-grained japonica or sinica variety, and the nonsticky, long-grained indica rice variety.

@EdMorton, @CharlesDuffy, @BenjaminW, thanks again for your comments and solutions. The info you have provided was immensely useful for me and I have learned a lot in the past week. I have taken your suggestions and summarize/document below for naive coders like me.

Thanks @EdMorton. Be careful about the meta charterers in substitution pattern! I have . in my data and it means 'everything' in RegExp. That needs to be escaped with backslash.
Thanks to @CharlesDuffy; When iterating over lots of items or lines with any loop in shell, use of any external tool inside that loop will considerably decrease efficiency. The new code below is highly efficient compared to the code above.
Thanks to @CharlesDuffy; while IFS=, read -r x y z _; do will read CSV and assign variables.
Thanks to @CharlesDuffy; echo "File: $DATA_FILE x: $x | y: $y | z: $z" ;passing everything to echo as a single string is better than what I have in my initial code above.

I am also curious regarding a python solution and I will update when I have a working code and some benchmarks.

@EdMorton I accepted Benjamins answer first but I think I misunderstood his comment to add as new answer. Anyway, I have accepted his post as accepted answer. BTW, this is my first question and I am not familiar with the etiquette. I should probably pay more attention to read some SO guidelines (if available). — Insilico, May 24 '19 at 17:52
No no, I meant that if you have your own, separate solution (like the Perl one), it's better to add it as an answer instead of making it part of the question. The question shouldn't contain solutions. — Benjamin W., May 24 '19 at 18:00

How to do minimal disk I/O on 'search and replace' of 20k terms on 100 text files (each with 50k lines)

3 Answers3