0

I have a problem when processing text files in a data processing pipeline in Shell and Python.

What is a better solution to print text files to stdout to put through a data processing pipeline (using perl in the script tokenise.sh and python)?

My current script in Shell works fine except that it does not output the last line in a txt file. I'm not sure if I should use cat or echo or something else (instead of while IFS= read line ...) for better performance.

for f in path/to/dir/*.txt; do
  while IFS= read line
  do
    echo $line 
  done < "$f" \
  | tokenize.sh \
  | python clean.py \
  >> $f.clean.txt 
  rm $f 
  mv $f.clean.txt $f 
done

I tried using awk as below and it seems to work well.

for f in path/to/dir/*.txt; do
  awk '{ print }' $f \
  | tokenize.sh \
  | python clean.py \
  >> $f.clean.txt 
  rm $f 
  mv $f.clean.txt $f 
done
tripleee
  • 175,061
  • 34
  • 275
  • 318
Sophil
  • 223
  • 1
  • 9
  • This is arguably too broad. A single question per question, please. – tripleee Aug 04 '19 at 18:09
  • 1
    The `while read` loop serves no useful purpose, and contains a quoting bug, and wild massively slow you down. You want simply `for f in path/to/dir/*.txt; do tokenize.sh <"$f" | python`... – tripleee Aug 04 '19 at 18:11
  • @tripleee Thank you so much for your comment! I will split the question 2 to ask in another thread. I tried your suggestion but the script does not stop running after the step `python clean.py`. I tried using `awk` and it seems to work well. I just do not feel safe of the solution because I'm very new to Shell script. – Sophil Aug 04 '19 at 18:22
  • @tripleee Oh I'm so sorry I tried your solution but I missed the `<"f"`part :(. – Sophil Aug 04 '19 at 18:26
  • note:I think it is generally a bad idea to delete the input file. (instead, you could compress it, or move it to another directory) – wildplasser Aug 04 '19 at 18:32

1 Answers1

2

Try this:

for f in path/to/dir/*.txt; do

  # - while loop replaced by "<"
  # - $f quoted to handle special chars. <<< IMPORTANT!
  # - is ">>" really necessary?
  #   seems to have a side effect, if "$f.clean.txt" already exists

  tokenize.sh < "$f" | python clean.py > "$f.clean.txt"

  # "mv" includes "rm" and && file "$f" exists always
  # rm $f
  mv "$f.clean.txt" "$f"

done
Wiimm
  • 2,971
  • 1
  • 15
  • 25
  • 1
    Maybe still highlight the significance of the quoting fixes. See [When to wrap quotes around a shell variable?](https://stackoverflow.com/questions/10067266/when-to-wrap-quotes-around-a-shell-variable) – tripleee Aug 04 '19 at 18:46