I would like to do a ‘search and replace’ on about 100 data files with unstructured text. Each of the files are about 50 MB in size and 50k lines. There are about 20k terms to do search and replace, which are stored in another file; 'terms_list.csv', a CSV format file with three columns COL1, COL2, and COL3. I need to search for words in COL1 and COL2 of 'terms_list.csv' in each of the 100 data files and replace with the corresponding word in COL3, if either of the words are found.
With my basic knowledge of shell scripting, I wrote the following shell script using AWK/SED to loop. It reads lines one by one from 20k lines ‘terms_list.csv’ and search for COL1 and COL2 in each of the 100 files, and if found, will replace with COL3.
for DATA_FILE in $(ls text_data_file_*.csv) #Data files (100 files) with 50k lines; contain terms in COL1 and COL2 of terms_list.csv
do
while read -r line;
do
x=$(echo $line | awk -F',' '{print $1}'); \
y=$(echo $line | awk -F',' '{print $2}'); \
z=$(echo $line | awk -F',' '{print $3}'); \
echo "File: " $DATA_FILE " x: "$x "|" "y: "$y "|" "z: "$z ; \
sed -i "s/$x/$z/;s/$y/$z/g" $DATA_FILE
done < terms_list.csv #20k lines in CSV format; each with search terms COL1,COL2, and replace term COL3
done
I am positive that there is a better/efficient code than above to accomplish the task as this require a lot of disk read/write. Any suggestions to improve? If there are better tools (perl/python) for this task, could you please give me some suggestions/directions to look at?
Below are a sample data for both of the files:
‘text_data_file_0001.csv’: One of the 100 data files, ‘text_data_file_0001.csv’ contain unstructured data as below which contain ‘TermFull’ and ‘TermAbbreviated’ among the text. [size of each file is about 50 MB and 50k lines]
ID000001,Mangifera indica, commonly known as mango, is a species of flowering plant in the sumac and poison ivy family Anacardiaceae. M. indica is a popular fruit in India. ID000002,Oryza sativa, commonly known as Asian rice, is the plant species most commonly referred to in English as rice. O. sativa contains two major subspecies: the sticky, short-grained japonica or sinica variety, and the nonsticky, long-grained indica rice variety.
'terms_list.csv' file: The search terms 'TermFull' and 'TermAbbreviated', and replace term 'TermJoined' are stored in ‘terms_list.csv’ contains 20k lines as below
TermFull,TermAbbreviated,TermJoined Mangifera indica,M. indica,Mangiferaindica Oryza sativa,O. sativa,Oryzasativa
Required output file ‘text_data_file0001.csv’ is as below with ‘TermFull’ and ‘TermAbbreviated’ replaced with ‘TermJoined’
ID000001,Mangiferaindica, commonly known as mango, is a species of flowering plant in the sumac and poison ivy family Anacardiaceae. Mangiferaindica is a popular fruit in India. ID000002,Oryzasativa, commonly known as Asian rice, is the plant species #most commonly referred to in English as rice. Oryzasativa contains two major subspecies: the sticky, short-grained japonica or sinica variety, and the nonsticky, long-grained indica rice variety.