I have a problem when processing text files in a data processing pipeline in Shell and Python.
What is a better solution to print text files to stdout
to put through a data processing pipeline (using perl
in the script tokenise.sh
and python
)?
My current script in Shell works fine except that it does not output the last line in a txt
file. I'm not sure if I should use cat
or echo
or something else (instead of while IFS= read line ...
) for better performance.
for f in path/to/dir/*.txt; do
while IFS= read line
do
echo $line
done < "$f" \
| tokenize.sh \
| python clean.py \
>> $f.clean.txt
rm $f
mv $f.clean.txt $f
done
I tried using awk
as below and it seems to work well.
for f in path/to/dir/*.txt; do
awk '{ print }' $f \
| tokenize.sh \
| python clean.py \
>> $f.clean.txt
rm $f
mv $f.clean.txt $f
done