Using Parallel with a sed command iterating over thousands of files

Question

I have 100,000s of files that I wish to iterate the below sed command over:

sed -s -i -e 's/[[:space:]].*//' -e '1 s/^/>/g' -e '3 s/|*//g' -e '3 s/^/>ref/g' -e '1h;2H;1,2d;4G'

So far, I have been using a bash loop:

for i in read_* ; do
    sed -s -i -e 's/[[:space:]].*//' -e '1 s/^/>/g' -e '3 s/|*//g' -e '3 s/^/>ref/g' -e '1h;2H;1,2d;4G' $i
    mv $i $i.fasta
done

How can I use GNU Parallel to speed this up?

ls read_* > list.read.txt
parallel -j $cores -a list.read.txt sed -s -i -e 's/[[:space:]].*//' -e '1 s/^/>/g' -e '3 s/|*//g' -e '3 s/^/>ref/g' -e '1h;2H;1,2d;4G' []

I tried the above method where I create a list of files to iterate over and perform 10 jobs at once, however I get sed related error commands.

Interesting problem but you forgot to include the most important bit of information ... *"I get sed related error "* ... What are they? Please add those to your question. Good luck. — shellter, Feb 03 '23 at 18:43

score 3 · Answer 1 · answered Feb 03 '23 at 17:38

Try

parallel -q -v -j "$cores" -a list.read.txt sed -s -i -e 's/[[:space:]].*//' -e '1 s/^/>/g' -e '3 s/|*//g' -e '3 s/^/>ref/g' -e '1h;2H;1,2d;4G'

The -q option is necessary to quote special characters (spaces, >, ...) in the command arguments.
The [] was causing the code to break when I tested it, so I removed it. I don't know what it was supposed to do.
I added quotes to "$cores" because variable expansions should almost always be quoted. See When to wrap quotes around a shell variable?. Use Shellcheck to find missing quotes, and many other shell code errors.

Using Parallel with a sed command iterating over thousands of files

1 Answers1