1

I have a large dataset files with two columns like

AS  জীৱবিজ্ঞানবিভাগ
AS  চেতনাদাস
AS  বৈকল্পিক

and I want to run my command on the second column, store the result and get the output with the same column formatting:

AS jibvigyanvibhag
AS chetanadas
AS baikalpik

where my command is this pipe:

echo "$0" | indictrans -s asm -t eng --ml --build-lookup

So I'm doing like

awk -v OFS="\t" '{ print "echo "$2" | indictrans -s asm -t eng --ml --build-lookup" | "/bin/sh"}' in.txt > out.txt

but this will not preserve the columns, it just prints out the first column like this

jibvigyanvibhag
chetanadas
baikalpik

My solution was the following

awk -v OFS="\t" '{ "echo "$2" | indictrans -s asm -t eng --ml --build-lookup" | getline RES; print $1,$2,RES}' in.txt > out.txt

that will print out

AS  জীৱবিজ্ঞানবিভাগ    jibvigyanvibhag
AS  চেতনাদাস    chetanadas
AS  বৈকল্পিক    baikalpik

Now I want to put parametrize the command, but the escape looks odd here:

"echo "$0" | indictrans -s $SOURCE -t $TARGET --ml --build-lookup"

and it does not work. How to correctly exec this command and escape the parameters?

[UPDATE] This is a partial solution I came out inspired by the suggested one

#!/bin/bash

SOURCE=asm
TARGET=eng
IN=$2
OUT=$3

awk -v OFS="\t" '{
        CMD = "echo "$2" | indictrans -s asm -t eng --ml --build-lookup"
        CMD | getline RES
        print $1,RES
        close(CMD)
}' $IN > $OUT

I still cannot get rid of the variables, it seems that I cannot define with -v as usual like

awk -v OFS="\t" -v source=$SOURCE -v target=$TARGET '{
            CMD = "echo "$2" | indictrans -s source -t target --ml --build-lookup"
...

NOTES.

The indictrans process handles the stdin and writes to stdout in this way:

    for line in ifp:
        tline = trn.convert(line)
        ofp.write(tline)
    # close files
    ifp.close()
    ofp.close()

where

ifp = codecs.getreader('utf8')(sys.stdin)
ofp = codecs.getwriter('utf8')(sys.stdout)

so it takes one line from stdin, processes the data with some library trn.convert and writes the results to stdout without any parallelism.

For this reason (lack of parallelism in terms of multiline input) the performances are bound by the size of the dataset (number of rows).

An example input two column dataset (1K rows) is available here. An example sample is

KN   ಐಕ್ಯತೆ ಕ್ಷೇಮಾಭಿವೃದ್ಧಿ ಸಂಸ್ಥೆ  ವಿಜಯಪುರ
KN   ಹೊರಗಿನ ಸಂಪರ್ಕಗಳು 
KN    ಮಕ್ಕಳ ಸಾಹಿತ್ಯ ಮತ್ತು ಸಾಂಸ್ಖ್ರುತಿಕ ಕ್ಷೇತ್ರದಲ್ಲಿ ಸೇವೆ ಸಲ್ಲಿಸುತ್ತಿರುವ ಸಂಸ್ಠೆ ಮಕ್ಕಳ ಲೋಕ  

while the example script based on the last accepted answer is here

loretoparisi
  • 15,724
  • 11
  • 102
  • 146
  • 1
    This might help: [How can I pass variables from awk to a shell command?](https://stackoverflow.com/q/20646819/3776858) – Cyrus Sep 20 '18 at 17:18
  • Is there a reason you want to use `awk` for this? There are serious security vulnerabilities in this code; it would be easier to write a secure implementation that didn't try to involve `awk` in the pipeline at all. – Charles Duffy Sep 20 '18 at 21:35
  • ...try passing, among your lines of input, `AS $(/tmp/foobar)'$(/tmp/foobar)'` -- if `/tmp/foobar` exists, it will be executed. – Charles Duffy Sep 20 '18 at 21:38
  • 1
    ...the shell itself treats data that comes from parameter expansion results differently than it treats code, but if something is passed into the shell *in the same string as the code*, such safety precautions are mooted. – Charles Duffy Sep 20 '18 at 21:40
  • 1
    BTW, can you stream multiple lines of input into `indictrans`? It might be much more efficient to keep a single long-running instance open; so long as it generates one line of output per line of input, that makes it easy to just generate a whole stream and then join it up with your first columns out-of-band. You'd get much better performance that way too. – Charles Duffy Sep 20 '18 at 21:52
  • @CharlesDuffy nope the binary takes `stdin` and write to `stdout` line by line. – loretoparisi Sep 20 '18 at 22:01
  • Ahh, well. I added a second branch to my answer which would work (far more efficiently) if-and-only-if the tool could be given 5 lines of input and would generate 5 lines of output with a 1:1 correlation; obviously, if that assumption doesn't hold, it's not much use. – Charles Duffy Sep 20 '18 at 22:02
  • That's a good point! I have checked the `indictrans` sources right now, check my update above. I think it's pretty simple how it is. – loretoparisi Sep 20 '18 at 22:07
  • 2
    Sure -- the space for improvements is speed more than simplicity; if you're translating hundreds of lines, the simpler approach runs hundreds of copies of `indictrans`, whereas the one with slightly hairier code just runs one (and shifts all hundred lines through it), so there's much less overhead involved in starting Python interpreters over and over. From just the code you've shown, I think the higher-performance version is likely to work fine, but if it doesn't, then a [mcve] (ideally with a link to `indictrans`, if it's public) would be needed. – Charles Duffy Sep 21 '18 at 01:57
  • @CharlesDuffy great. I have updated the script adding 1) the example dataset with 1K rows (my rows size go from ~100K to ~2M rows) 2) the script based on the last answer 3) the link to the tool source code. – loretoparisi Sep 21 '18 at 12:52

2 Answers2

3

Don't invoke shells with awk. The shell itself avoids treating data as if it were code unless explicitly instructed to do otherwise -- but when you use system() or popen(), as the awk code is doing here, everything passed as an argument is parsed in a context where data is able to escape its quoting and be treated as code.


Simple approach: One indictrans per line

If you need a separate copy of indictrans for each line to be executed, use:

while read -r col1 rest; do
  printf '%s\t%s\n' "$col1" "$(indictrans -s asm -t eng --ml --build-lookup <<<"$rest")"
done <in.txt >out.txt

Fast Approach: One indictrans processing all lines

If indictrans generates one line of output per line of input, you can do even better, by pasting together one stream with all the first columns and a second string with the translations of the remainder of the lines, thus requiring only one copy of indictrans to be run:

#!/usr/bin/env bash
#              ^^^^- not compatible with /bin/sh

paste <(<in.txt awk '{print $1}') \
      <(<in.txt sed -E 's/^[^[:space:]]*[[:space:]]//' \
                | indictrans -s asm -t eng --ml --build-lookup) \
  >out.txt
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
  • Thank you! I have also noticed that `awk` introduces further issues when executing commands like having those process forks causes `too many file opened` errors (with both `system` and open/getline` approaches). – loretoparisi Sep 20 '18 at 21:51
  • I have updated this answer (that was previously accepted) because as for the comments above an improvement in terms of performances could be done, given more info about dataset and a running example. See the updated question/comments for more info. – loretoparisi Sep 21 '18 at 12:54
  • 1
    This answer **already includes** the performance improvements -- that's the second half using `paste`. – Charles Duffy Sep 21 '18 at 13:58
  • pardon! right, by the way the optimized version does not keep the columns structure - see here https://gist.github.com/loretoparisi/2abb25c1db934bf7b77c6cc62cd857d7 what happens. Thanks! – loretoparisi Sep 21 '18 at 14:30
  • Your code changed `print $1` in my original to `print $2`, so that's the problem with the first column. The second column I can't help with, because I can't run `indictrans` -- it's a messy piece of software (incomplete `setup.py`, undeclared library dependencies, etc) -- but I'll give you another script you can use to prove that the pattern works when run with code that correctly translates input line-by-line: `paste <(<"$in" awk '{print $1}') <(<"$in" sed -E 's/^[^[:space:]]*[[:space:]]//' | while IFS= read -r line; do echo "$(base64 -w 0 <<<"$line")"; done)` – Charles Duffy Sep 21 '18 at 15:45
1

You can pipe column 2 to your command and change it with command's output like below in awk.

{
    cmd = "echo "$2" | indictrans -s asm -t eng --ml --build-lookup"
    cmd | getline $2
    close(cmd)
} 1

If SOURCE and TARGET are awk variables

{
    cmd = "echo "$0" | indictrans -s "SOURCE" -t "TARGET" --ml --build-lookup"
    cmd
    close(cmd)
}
oguz ismail
  • 1
  • 16
  • 47
  • 69