I have a large dataset files with two columns like
AS জীৱবিজ্ঞানবিভাগ
AS চেতনাদাস
AS বৈকল্পিক
and I want to run my command on the second column, store the result and get the output with the same column formatting:
AS jibvigyanvibhag
AS chetanadas
AS baikalpik
where my command is this pipe:
echo "$0" | indictrans -s asm -t eng --ml --build-lookup
So I'm doing like
awk -v OFS="\t" '{ print "echo "$2" | indictrans -s asm -t eng --ml --build-lookup" | "/bin/sh"}' in.txt > out.txt
but this will not preserve the columns, it just prints out the first column like this
jibvigyanvibhag
chetanadas
baikalpik
My solution was the following
awk -v OFS="\t" '{ "echo "$2" | indictrans -s asm -t eng --ml --build-lookup" | getline RES; print $1,$2,RES}' in.txt > out.txt
that will print out
AS জীৱবিজ্ঞানবিভাগ jibvigyanvibhag
AS চেতনাদাস chetanadas
AS বৈকল্পিক baikalpik
Now I want to put parametrize the command, but the escape looks odd here:
"echo "$0" | indictrans -s $SOURCE -t $TARGET --ml --build-lookup"
and it does not work. How to correctly exec this command and escape the parameters?
[UPDATE] This is a partial solution I came out inspired by the suggested one
#!/bin/bash
SOURCE=asm
TARGET=eng
IN=$2
OUT=$3
awk -v OFS="\t" '{
CMD = "echo "$2" | indictrans -s asm -t eng --ml --build-lookup"
CMD | getline RES
print $1,RES
close(CMD)
}' $IN > $OUT
I still cannot get rid of the variables, it seems that I cannot define with -v
as usual like
awk -v OFS="\t" -v source=$SOURCE -v target=$TARGET '{
CMD = "echo "$2" | indictrans -s source -t target --ml --build-lookup"
...
NOTES.
The indictrans process handles the stdin
and writes to stdout
in this way:
for line in ifp:
tline = trn.convert(line)
ofp.write(tline)
# close files
ifp.close()
ofp.close()
where
ifp = codecs.getreader('utf8')(sys.stdin)
ofp = codecs.getwriter('utf8')(sys.stdout)
so it takes one line
from stdin
, processes the data with some library trn.convert
and writes the results to stdout
without any parallelism.
For this reason (lack of parallelism in terms of multiline input) the performances are bound by the size of the dataset (number of rows).
An example input two column dataset (1K rows) is available here. An example sample is
KN ಐಕ್ಯತೆ ಕ್ಷೇಮಾಭಿವೃದ್ಧಿ ಸಂಸ್ಥೆ ವಿಜಯಪುರ
KN ಹೊರಗಿನ ಸಂಪರ್ಕಗಳು
KN ಮಕ್ಕಳ ಸಾಹಿತ್ಯ ಮತ್ತು ಸಾಂಸ್ಖ್ರುತಿಕ ಕ್ಷೇತ್ರದಲ್ಲಿ ಸೇವೆ ಸಲ್ಲಿಸುತ್ತಿರುವ ಸಂಸ್ಠೆ ಮಕ್ಕಳ ಲೋಕ
while the example script based on the last accepted answer is here