Faster way of batch renaming a million files in bash with sed or awk on multiple cores

Question

I wrote a bash function that turns filenames into title case, then capitalizes acronyms, camera _DSC/_IMG prefixes, etc. It also makes conjunctions lowercase.

The problem is that it is very slow. I ran it on 1000 files as a test, and it took 4m40s to finish. And I have over 1 million files to rename.

The function takes a bunch of sed replacements out of a file, and runs each of these to convert the needed caps/lowercase after the initial conversion.

Here is the function: (I can only have BSD/OSX bash so ${a,,} and sed 's/./\L&/g' do not work for me...)

IFS=$'\r\n' GLOBIGNORE='*' command eval  'capsarray=($(cat capslist.txt))'
tc () {
capped="$(echo "$1" | tr [[:upper:]] [[:lower:]] | sed "s^_^_ ^g" | sed "s^/^/ ^g" | awk '{for(i=1;i<=NF;i++){ $i=toupper(substr($i,1,1)) substr($i,2) }}1')"
for d in "${capsarray[@]}"; do
    capped2="$(echo "$capped" | sed "s^$d^g")"
    capped="$capped2"

done
capped3="$(echo "$capped" | sed "s^_ ^_^g" | sed "s^/ ^/^g")"
echo "$capped3"
}

and here is part of capslist.txt:

A ^a 
W ^w 
With ^with 
The ^the 
Is ^is 
Of ^of 
And ^and 
Or ^or 
But ^but 
To ^to 
In ^in 
By ^by 
Dsc^DSC
Img^IMG
_Mg^_MG
\(_[0-9][0-9]\)c^\1C
\(_[0-9][0-9]\)s^\1S
\(_[0-9][0-9]\)a^\1A
_Dji^_DJI
Dji_^DJI_
Img_^IMG_
_Kis\([0-9][0-9]\)^_KIS\1
Dk ^DK 
Uk ^UK 
Eu ^EU

etc...

There are a lot of entries in this list and i need to add even more for it to be complete.

Attempted solution to speed it up:

So I split the input files list to 16 arrays and do the function simultaneously on each portion. This makes it a lot faster, as now it can run on multiple cores.

Now I'm down to 1m30s on my 4-core (8 thread) machine.

But it's still not fast enough to do 1 million files.

Here is the glorious multicore bash beauty:

####### 16 THREADS:
declare -a array1
declare -a array2
declare -a array3
declare -a array4
declare -a array5
declare -a array6
declare -a array7
declare -a array8
declare -a array9
declare -a array10
declare -a array11
declare -a array12
declare -a array13
declare -a array14
declare -a array15
declare -a array16
total=${#inputfiles[@]}
div1="$(expr $total / 16)"
div2="$(expr $div1 + $div1)"
div3="$(expr $div1 + $div1 + $div1)"
div4="$(expr $div1 + $div1 + $div1 + $div1)"
div5="$(expr $div1 + $div1 + $div1 + $div1 + $div1)"
div6="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div7="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div8="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div9="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div10="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div11="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div12="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div13="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div14="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div15="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
array1=("${inputfiles[@]:0:$div1}")
array2=("${inputfiles[@]:$div1:$div1}")
array3=("${inputfiles[@]:$div2:$div1}")
array4=("${inputfiles[@]:$div3:$div1}")
array5=("${inputfiles[@]:$div4:$div1}")
array6=("${inputfiles[@]:$div5:$div1}")
array7=("${inputfiles[@]:$div6:$div1}")
array8=("${inputfiles[@]:$div7:$div1}")
array9=("${inputfiles[@]:$div8:$div1}")
array10=("${inputfiles[@]:$div9:$div1}")
array11=("${inputfiles[@]:$div10:$div1}")
array12=("${inputfiles[@]:$div11:$div1}")
array13=("${inputfiles[@]:$div12:$div1}")
array14=("${inputfiles[@]:$div13:$div1}")
array15=("${inputfiles[@]:$div14:$div1}")
array16=("${inputfiles[@]:$div15}")
function tcmv {
    b="$(basename "$d")"
    pathtofile="$(dirname "$d")"
    mv "$d" "$pathtofile"/"$(tc "$b")"
}
for d in "${array1[@]}"; do tcmv; done &
for d in "${array2[@]}"; do tcmv; done &
for d in "${array3[@]}"; do tcmv; done &
for d in "${array4[@]}"; do tcmv; done &
for d in "${array5[@]}"; do tcmv; done &
for d in "${array6[@]}"; do tcmv; done &
for d in "${array7[@]}"; do tcmv; done &
for d in "${array8[@]}"; do tcmv; done &
for d in "${array9[@]}"; do tcmv; done &
for d in "${array10[@]}"; do tcmv; done &
for d in "${array11[@]}"; do tcmv; done &
for d in "${array12[@]}"; do tcmv; done &
for d in "${array13[@]}"; do tcmv; done &
for d in "${array14[@]}"; do tcmv; done &
for d in "${array15[@]}"; do tcmv; done &
for d in "${array16[@]}"; do tcmv; done &
wait

EDIT: Example conversions with the custom case called out:

20200908_LA_HOLLYWOOD AND HIGHLAND_BUSY STREET_TIME LAPSE_DSC0795.NEF
20200908_LA_Hollywood and Highland_Busy Street_Time Lapse_DSC0795.NEF
         ^^           ^^^                                 ^^^     ^^^

20180706_STADIUM_SFO VS NYC_BASKETBALL MATCH_04B8897_OK TO PUBLISH.JPG
20180706_Stadium_SFO vs NYC_Basketball Match_04B8897_OK to Publish.JPG
                 ^^^ ^^ ^^^                   ^      ^^ ^^         ^^^

So my question is:

Is there a way to make this faster? And how?

EDIT2:

Based on the suggestions in the comments, I was able to come up with something a lot better:

IFS=$'\r\n' GLOBIGNORE='*' command eval  'capsarray=($(cat capslist.txt))'
cat tmplist | tr [[:upper:]] [[:lower:]] | sed "s^_^_ ^g" | sed "s^/^/ ^g" | awk '{for(i=1;i<=NF;i++){ $i=toupper(substr($i,1,1)) substr($i,2) }}1' > tmplistB
cat tmplist2 | tr [[:upper:]] [[:lower:]] | sed "s^_^_ ^g" | sed "s^/^/ ^g" | awk '{for(i=1;i<=NF;i++){ $i=toupper(substr($i,1,1)) substr($i,2) }}1' > tmplist2B
commandvar="sed"
commandvar2="sed"
for d in "${capsarray[@]}"; do
    commandvar+=" -e "s^$d^g""
    commandvar2+=" -e "s^$d^g""
done
commandvar+=' > tmplistC'
commandvar2+=' > tmplist2C'
cat tmplistB | eval $commandvar
cat tmplist2B | eval $commandvar2
cat tmplistC | sed "s^_ ^_^g" | sed "s^/ ^/^g" > tmplistD
cat tmplist2C | sed "s^_ ^_^g" | sed "s^/ ^/^g" > tmplist2D
IFS=$'\r\n' GLOBIGNORE='*' command eval  'renamedarray=($(cat tmplistD))'
IFS=$'\r\n' GLOBIGNORE='*' command eval  'renameddirarray=($(cat tmplist2D))'

i=0
for d in "${renamedarray[@]}"; do
    b="$(basename "$d")"
    pathtofile="$(dirname "${myarray[$i]}")"
    mv "${myarray[$i]}" "$pathtofile/$b"
    i=$((i+1))
done

This now does all the hard work on a text file instead of the real files and uses the sed -s (thanks @Mark-Setchell nad @thanasisp)

Way faster. Now it's down to 15 seconds for 1000 files, without having to split the array into 16 threads. It would only take me a little over 4 hours to rename a million files!

(also it does files first, and then folders, in order to avoid renaming a folder before the files are renamed inside it)

So, as far as I'm concerned, this is good enough for what I need to do, but I didn't think I should answer my question with the above, as I feel it could be even more efficient. For example, with the perl and awk solution. (however I'm still trying to understand these... keep googling)

You could stop creating so many processes, e.g. change `sed thingA | sed thingB` to `sed -e thingA -e thingB`. You could do all the`mv` commands by putting original and new names in a file and running the file through **Perl** so each rename is just a library call not a whole new process. — Mark Setchell, Oct 09 '20 at 06:37
Indeed, use another language. Bash creates processes for each command. With Perl (one example) it will create only 1 process, and it will be much much faster. You could also start a couple instances, like a*, b*, ... z* to process many files at once. — Nic3500, Oct 09 '20 at 06:41
I suggest you focus into describing in a clear way the rename rules, for exampe what `A ^a` means? Currently what you present as content of `capslist.txt` is not a well-communicated format to describe patterns replacing. Maybe also include a few representative filenames before and after the renaming. Also, you should use the `rename` command, as stated above, if it exists for your environment. Bash should not be used like a programming language, as in your examples, by as a high-level task manager. If you call ten processes to rename one file, it will be really slow. — thanasisp, Oct 09 '20 at 07:29
You spawn a **minimum** of 17 subshells per function call plus 3 more for every additional element in `"${capsarray[@]}"`. That will take *Forever* for a million files. Rather than showing a partial `caplist.txt`, you need to explain (map out) the rules you are using to modify the filenames and then you can probably write a single `awk` script to apply each rule to each filename in a single call to `awk`. That would be orders of magnitude faster and likely handle a million files in about a minute. — David C. Rankin, Oct 09 '20 at 07:44
@David C. Rankin Yes, it's very inefficient. Hence this question. I added an example to make the rules more clear, but as you can see I need to match a large amount of things, such as countries, cities, etc. I didn't know that `awk` could do such thing in one script. Would you mind giving me an example? — user122121, Oct 09 '20 at 16:21
Each rule in `caplist.txt` can be an `awk` rule, applied to every filename in a single call to `awk`. For example `/A/{gsub(/A/, "a")}` will convert every `'A'` to `'a'` in a record. — David C. Rankin, Oct 09 '20 at 19:49

Faster way of batch renaming a million files in bash with sed or awk on multiple cores

0 Answers0