Improve performance of Bash loop that removes windows line endings

Question

^{Editor's note: This question was always about loop performance, but the original title led some answerers - and voters - to believe it was about how to remove Windows line endings.}

The below bash loop below just remove the windows line endings and converts them to unix and appears to be running, but it is slow. The input files are small (4 files ranging from 167 bytes - 1 kb), and are all the same structure (list of names) and the only thing that varies is the length (ie. some files are 10 names others are 50). Is it supposed to take over 15 minutes to complete this task using a xeon processor? Thank you :)

for f in /home/cmccabe/Desktop/files/*.txt ; do
 bname=`basename $f`
 pref=${bname%%.txt}
sed 's/\r//' $f - $f > /home/cmccabe/Desktop/files/${pref}_unix.txt
done

Input .txt files

AP3B1
BRCA2
BRIP1
CBL
CTC1

EDIT

This is not a duplicate as I was more asking for why my bash loop that uses sed to remove windows line endings was running so slow. I did not mean to imply how to remove them, was asking for ideas that might speed up the loop and I got many. Thank you :). I hope this helps.

Possible duplicate of [Remove carriage return in Unix](http://stackoverflow.com/questions/800030/remove-carriage-return-in-unix) — Mr. Llama, Oct 07 '15 at 19:20
That's like asking why the water in your glass is wet. A shell loop calling sed **IS** incredibly slow. — Ed Morton, Oct 07 '15 at 20:58

score 6 · Answer 1 · edited Oct 07 '15 at 19:19

6

Use the utilities dos2unix and unix2dos to convert between unix and windows style line endings.

edited Oct 07 '15 at 19:19

John1024

109,961
14
137
171

answered Oct 07 '15 at 19:11

bta · Accepted Answer · 2015-10-19T19:55:15.017

5

Your 'sed' command looks wrong. I believe the trailing $f - $f should simply be $f. Running your script as written hangs for a very long time on my system, but making this change causes it to complete almost instantly.

Of course, the best answer is to use dos2unix, which was designed to handle this exact thing:

cd /home/cmccabe/Desktop/files
for f in *.txt ; do
    pref=$(basename -s '.txt' "$f")
    dos2unix -q -n "$f" "${pref}_unix.txt"
done

edited Oct 19 '15 at 19:55

answered Oct 07 '15 at 19:38

bta

43,959
6
69
99

Thank you very much, that was it removing the `- $f` made the loop much faster :) – justaguy Oct 07 '15 at 20:59

score 4 · Answer 3 · answered Oct 07 '15 at 19:32

4

This always works for me:

perl -pe 's/\r\n/\n/' inputfile.txt > outputfile.txt

answered Oct 07 '15 at 19:32

JochenDB

588
10
31

score 1 · Answer 4 · answered Oct 07 '15 at 19:37

1

you can use dos2unix as stated before or use this small sed:

sed 's/\r//' file

answered Oct 07 '15 at 19:37

midori

4,807
5
34
62

True (assuming you use _GNU_ `sed`), but that command was part of the OP's question to begin with; the OP's question was not about _how_ to do it, but how to do it _faster_ in their specific scenario. – mklement0 Oct 19 '15 at 21:05
the question was totally different when i was answering it @mklement0 – midori Oct 20 '15 at 22:09
The original _title_ lacked focus, which is why I changed it and added a note at the top; the _gist_ of the _body_ never changed, however ("why is this so slow?") and it contained `sed 's/\r//'` from the very beginning. – mklement0 Oct 21 '15 at 02:57

mklement0 · Answer 5 · 2015-10-21T05:16:12.483

The key to performance in Bash is to avoid loops in general, and in particular those that call one or more external utilities in each iteration.

Here is a solution that uses a single GNU awk command:

awk -v RS='\r\n' '
  BEGINFILE { outFile=gensub("\\.txt$", "_unix&", 1, FILENAME) }
 { print > outFile }
' /home/cmccabe/Desktop/files/*.txt

-v RS='\r\n' sets CRLF as the input record separator, and by virtue of leaving ORS, the output record separator at its default, \n, simply printing each input line will terminate it with \n.
the BEGINFILE block is executed every time processing of a new input file starts; in it, gensub() is used to insert _unix before the .txt suffix of the input file at hand to form the output filename.
{print > outFile} simply prints the \n-terminated lines to the output file at hand.

^{Note that use of a multi-char. RS value, the BEGINFILE block, and the gensub() function are GNU extensions to the POSIX standard.

Switching from the OP's sed solution to a GNU awk-based one was necessary in order to provide a single-command solution that is both simpler and faster.}

Alternatively, here's a solution that relies on dos2unix for conversion of Window line-endings (for instance, you can install dos2unix with sudo apt-get install dos2unix on Debian-based systems); except for requiring dos2unix, it should work on most platforms (no GNU utilities required):

It uses a loop only to construct the array of filename arguments to pass to dos2unix - this should be fast, given that no call to basename is involved; Bash-native parameter expansion is used instead.
then uses a single invocation of dos2unix to process all files.

# cd to the target folder, so that the operations below do not need to handle
# path components.
cd '/home/cmccabe/Desktop/files'

# Collect all *.txt filenames in an array.
inFiles=( *.txt )

# Derive output filenames from it, using Bash parameter expansion:
# '%.txt' matches '.txt' at the end of each array element, and replaces it
# with '_unix.txt', effectively inserting '_unix' before the suffix.
outFiles=( "${inFiles[@]/%.txt/_unix.txt}" )

# Create an interleaved array of *input-output filename pairs* to be passed
# to dos2unix later.
# To inspect the resulting array, run `printf '%s\n' "${fileArgs[@]}"`
# You'll see pairs like these:
#    file1.txt
#    file1_unix.txt
#    ...
fileArgs=(); i=0
for inFile in "${inFiles[@]}"; do
  fileArgs+=( "$inFile" "${outFiles[i++]}" )
done

# Now, use a *single* invocation of dos2unix, passing all input-output
# filename pairs at once.
dos2unix -q -n "${fileArgs[@]}"

Improve performance of Bash loop that removes windows line endings

5 Answers5