1

I want to create a utility function for bash to remove duplicate lines. I am using function

function remove_empty_lines() {
  if ! command -v awk &> /dev/null
  then
    echo '[x] ERR: "awk" command not found'
    return
  fi

  if [[ -z "$1" ]]
  then
    echo "usage: remove_empty_lines <file-name> [--replace]"
    echo
    echo "Arguments:"
    echo -e "\t--replace\t (Optional) If not passed, the result will be redirected to stdout"
    return
  fi

  if [[ ! -f "$1" ]]
  then
    echo "[x] ERR: \"$1\" file not found"
    return
  fi
  echo $0
  local CMD="awk '!seen[$0]++' $1"

  if [[ "$2" = '--reload' ]]
  then
    CMD+=" > $1"
  fi

  echo $CMD
}

If I am running the main awk command directly, it is working. But when i execute the same $CMD in the function, I am getting this error

$ remove_empty_lines app.js
/bin/bash
awk '!x[/bin/bash]++' app.js
tbhaxor
  • 1,659
  • 2
  • 13
  • 43
  • can you show how you are executing `$CMD` in the function? – Christian Fritz Oct 16 '20 at 15:27
  • 1
    There's no execution in the code shown here at all. – Charles Duffy Oct 16 '20 at 15:41
  • FYI, re: `function funcname() {`, see https://wiki.bash-hackers.org/scripting/obsolete – Charles Duffy Oct 16 '20 at 15:43
  • If you build a string that you want the shell to evaluate, you should use `eval`. Given that you are appending what appears to be file redirections into your command, that appears to be what you want. But see many other sources for why that is not what you want. (Eg, google "eval is evil") – William Pursell Oct 16 '20 at 15:50
  • @WilliamPursell, I disagree with saying that someone "should use eval" without there being compelling reasons why they can't accomplish their stated goal without it. In this case the clear reason they're going that route is to be able to perform an optional redirection, but there are safer ways to do that. – Charles Duffy Oct 16 '20 at 15:53
  • @CharlesDuffy I don't disagree with you, and perhaps my phrasing was poor. I am strongly suggesting that the overall approach should be changed to avoid constructing a string to be evaluated. – William Pursell Oct 16 '20 at 16:18
  • @ChristianFritz I am executing function by `$(${CMD})` – tbhaxor Oct 16 '20 at 16:20
  • @tbhaxor, read BashFAQ #50 (as previously linked) to understand why that's unreliable and what the alternatives are. (One easy thing you can try is `CMD='echo hello >out'; $(${CMD})`; you'll see it writing `hello >out` to stdout instead of writing `hello` to a file named `out`). – Charles Duffy Oct 16 '20 at 16:38

2 Answers2

6

The original code is broken in several ways:

  • When used with --reload, it would truncate the output file's contents before awk could ever read those contents (see How can I use a file in a command and redirect output to the same file without truncating it?)
  • It didn't ever actually run the command, and for the reasons described in BashFAQ #50, storing a shell command in a string is inherently buggy (one can work around some of those issues with eval; BashFAQ #48 describes why doing so introduces security bugs).
  • It wrote error messages (and other "diagnostic content") to stdout instead of stderr; this means that if your function's output was redirected to a file, you could never see its errors -- they'd end up jumbled into the output.
  • Error cases were handled with a return even in cases where $? would be zero; this means that return itself would return a zero/successful/truthy status, not revealing to the caller that any error had taken place.

Presumably the reason you were storing your output in CMD was to be able to perform a redirection conditionally, but that can be done other ways: Below, we always create a file descriptor out_fd, but point it to either stdout (when called without --reload), or to a temporary file (if called with --reload); if-and-only-if awk succeeds, we then move the temporary file over the output file, thus replacing it as an atomic operation.

remove_empty_lines() {
  local out_fd rc=0 tempfile=
  command -v awk &>/dev/null || { echo '[x] ERR: "awk" command not found' >&2; return 1; }

  if [[ -z "$1" ]]; then
    printf '%b\n' >&2 \
      'usage: remove_empty_lines <file-name> [--replace]' \
      '' \
      'Arguments:' \
      '\t--replace\t(Optional) If not passed, the result will be redirected to stdout'
    return 1
  fi

  [[ -f "$1" ]] || { echo "[x] ERR: \"$1\" file not found" >&2; return 1; }

  if [ "$2" = --reload ]; then
    tempfile=$(mktemp -t "$1.XXXXXX") || return
    exec {out_fd}>"$tempfile" || { rc=$?; rm -f "$tempfile"; return "$rc"; }
  else
    exec {out_fd}>&1
  fi

  awk '!seen[$0]++' <"$1" >&$out_fd || { rc=$?; rm -f "$tempfile"; return "$rc"; }
  exec {out_fd}>&- # close our file descriptor

  if [[ $tempfile ]]; then
    mv -- "$tempfile" "$1" || return
  fi
}
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
  • Nice answer. Note that the awk script in question has to be able to read all unique $0 values into memory so if your input file has a massive number of unique lines that'll fail but if you have to handle that case then you could do this instead: `cat -n file | sort -k2 -k1n | uniq -f1 | sort -k1n | cut -f2-` The only command in that pipeline that has to process the whole file "at once" is `sort` and it's designed to use paging, etc. to handle massive files so it should do better than `awk` in that regard. If your `cat` doesn't have a `-n` option then use `awk '{print NR "\t" $0}'`. – Ed Morton Oct 16 '20 at 18:09
  • Thanks, It actually worked. Also thanks for the KB you have provided – tbhaxor Oct 16 '20 at 20:40
1

First off the output from your function call is not an error but rather the output of two echo commands (echo $0 and echo $CMD).

And as Charles Duffy has pointed out ... at no point is the function actually running the $CMD.


As for the inclusion of /bin/bash in your function's echo output ... the main problem is the reference to $0; by definition $0 is the name of the running process, which in the case of a function is the shell under which the function is being called. Consider the following when run from a bash command prompt:

$ echo $0
-bash

As you can see from your output this generates /bin/bash in your environment. See this and this for more details.

On a related note, the reference to $0 within double quotes causes the $0 to be evaluated, so this:

local CMD="awk '!seen[$0]++' $1"

becomes

local CMD="awk '!seen[/bin/bash]++' app.js"

I'm thinking what you want is something like:

echo $1                               # the name of the file to be processed
local CMD="awk '!seen[\$0]++' $1"     # escape the '$' in '$0'

becomes

local CMD="awk '!seen[$0]++' app.js"

That should fix the issues shown in your function's output; as for the other issues ... you're getting a good bit of feedback in the various comments ...

markp-fuso
  • 28,790
  • 4
  • 16
  • 36
  • Certainly not the only problem; they're also running afoul of [BashFAQ #50](http://mywiki.wooledge.org/BashFAQ/050). Assigning a string with the text of a command to a variable should not be done at all. – Charles Duffy Oct 16 '20 at 15:42
  • Also, their `>app.js` will empty out the output file before `awk` can ever read its original contents if they run with `--reload`. – Charles Duffy Oct 16 '20 at 15:48
  • @CharlesDuffy; at this point I'm just addressing the (non error) output of the OPs question and explaining the `$0 = /bin/bash`; you're welcome to write up an answer with all of the technical issues should the OP figure out how to 'run' `$CMD` :-) – markp-fuso Oct 16 '20 at 15:51