1

I wrote a bash function that turns filenames into title case, then capitalizes acronyms, camera _DSC/_IMG prefixes, etc. It also makes conjunctions lowercase.

The problem is that it is very slow. I ran it on 1000 files as a test, and it took 4m40s to finish. And I have over 1 million files to rename.

The function takes a bunch of sed replacements out of a file, and runs each of these to convert the needed caps/lowercase after the initial conversion.

Here is the function: (I can only have BSD/OSX bash so ${a,,} and sed 's/./\L&/g' do not work for me...)

IFS=$'\r\n' GLOBIGNORE='*' command eval  'capsarray=($(cat capslist.txt))'
tc () {
capped="$(echo "$1" | tr [[:upper:]] [[:lower:]] | sed "s^_^_ ^g" | sed "s^/^/ ^g" | awk '{for(i=1;i<=NF;i++){ $i=toupper(substr($i,1,1)) substr($i,2) }}1')"
for d in "${capsarray[@]}"; do
    capped2="$(echo "$capped" | sed "s^$d^g")"
    capped="$capped2"

done
capped3="$(echo "$capped" | sed "s^_ ^_^g" | sed "s^/ ^/^g")"
echo "$capped3"
}

and here is part of capslist.txt:

A ^a 
W ^w 
With ^with 
The ^the 
Is ^is 
Of ^of 
And ^and 
Or ^or 
But ^but 
To ^to 
In ^in 
By ^by 
Dsc^DSC
Img^IMG
_Mg^_MG
\(_[0-9][0-9]\)c^\1C
\(_[0-9][0-9]\)s^\1S
\(_[0-9][0-9]\)a^\1A
_Dji^_DJI
Dji_^DJI_
Img_^IMG_
_Kis\([0-9][0-9]\)^_KIS\1
Dk ^DK 
Uk ^UK 
Eu ^EU 

etc...

There are a lot of entries in this list and i need to add even more for it to be complete.

Attempted solution to speed it up:

So I split the input files list to 16 arrays and do the function simultaneously on each portion. This makes it a lot faster, as now it can run on multiple cores.

Now I'm down to 1m30s on my 4-core (8 thread) machine.

But it's still not fast enough to do 1 million files.

Here is the glorious multicore bash beauty:

####### 16 THREADS:
declare -a array1
declare -a array2
declare -a array3
declare -a array4
declare -a array5
declare -a array6
declare -a array7
declare -a array8
declare -a array9
declare -a array10
declare -a array11
declare -a array12
declare -a array13
declare -a array14
declare -a array15
declare -a array16
total=${#inputfiles[@]}
div1="$(expr $total / 16)"
div2="$(expr $div1 + $div1)"
div3="$(expr $div1 + $div1 + $div1)"
div4="$(expr $div1 + $div1 + $div1 + $div1)"
div5="$(expr $div1 + $div1 + $div1 + $div1 + $div1)"
div6="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div7="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div8="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div9="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div10="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div11="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div12="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div13="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div14="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div15="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
array1=("${inputfiles[@]:0:$div1}")
array2=("${inputfiles[@]:$div1:$div1}")
array3=("${inputfiles[@]:$div2:$div1}")
array4=("${inputfiles[@]:$div3:$div1}")
array5=("${inputfiles[@]:$div4:$div1}")
array6=("${inputfiles[@]:$div5:$div1}")
array7=("${inputfiles[@]:$div6:$div1}")
array8=("${inputfiles[@]:$div7:$div1}")
array9=("${inputfiles[@]:$div8:$div1}")
array10=("${inputfiles[@]:$div9:$div1}")
array11=("${inputfiles[@]:$div10:$div1}")
array12=("${inputfiles[@]:$div11:$div1}")
array13=("${inputfiles[@]:$div12:$div1}")
array14=("${inputfiles[@]:$div13:$div1}")
array15=("${inputfiles[@]:$div14:$div1}")
array16=("${inputfiles[@]:$div15}")
function tcmv {
    b="$(basename "$d")"
    pathtofile="$(dirname "$d")"
    mv "$d" "$pathtofile"/"$(tc "$b")"
}
for d in "${array1[@]}"; do tcmv; done &
for d in "${array2[@]}"; do tcmv; done &
for d in "${array3[@]}"; do tcmv; done &
for d in "${array4[@]}"; do tcmv; done &
for d in "${array5[@]}"; do tcmv; done &
for d in "${array6[@]}"; do tcmv; done &
for d in "${array7[@]}"; do tcmv; done &
for d in "${array8[@]}"; do tcmv; done &
for d in "${array9[@]}"; do tcmv; done &
for d in "${array10[@]}"; do tcmv; done &
for d in "${array11[@]}"; do tcmv; done &
for d in "${array12[@]}"; do tcmv; done &
for d in "${array13[@]}"; do tcmv; done &
for d in "${array14[@]}"; do tcmv; done &
for d in "${array15[@]}"; do tcmv; done &
for d in "${array16[@]}"; do tcmv; done &
wait

EDIT: Example conversions with the custom case called out:

20200908_LA_HOLLYWOOD AND HIGHLAND_BUSY STREET_TIME LAPSE_DSC0795.NEF
20200908_LA_Hollywood and Highland_Busy Street_Time Lapse_DSC0795.NEF
         ^^           ^^^                                 ^^^     ^^^

20180706_STADIUM_SFO VS NYC_BASKETBALL MATCH_04B8897_OK TO PUBLISH.JPG
20180706_Stadium_SFO vs NYC_Basketball Match_04B8897_OK to Publish.JPG
                 ^^^ ^^ ^^^                   ^      ^^ ^^         ^^^

So my question is:

Is there a way to make this faster? And how?

EDIT2:

Based on the suggestions in the comments, I was able to come up with something a lot better:

IFS=$'\r\n' GLOBIGNORE='*' command eval  'capsarray=($(cat capslist.txt))'
cat tmplist | tr [[:upper:]] [[:lower:]] | sed "s^_^_ ^g" | sed "s^/^/ ^g" | awk '{for(i=1;i<=NF;i++){ $i=toupper(substr($i,1,1)) substr($i,2) }}1' > tmplistB
cat tmplist2 | tr [[:upper:]] [[:lower:]] | sed "s^_^_ ^g" | sed "s^/^/ ^g" | awk '{for(i=1;i<=NF;i++){ $i=toupper(substr($i,1,1)) substr($i,2) }}1' > tmplist2B
commandvar="sed"
commandvar2="sed"
for d in "${capsarray[@]}"; do
    commandvar+=" -e "s^$d^g""
    commandvar2+=" -e "s^$d^g""
done
commandvar+=' > tmplistC'
commandvar2+=' > tmplist2C'
cat tmplistB | eval $commandvar
cat tmplist2B | eval $commandvar2
cat tmplistC | sed "s^_ ^_^g" | sed "s^/ ^/^g" > tmplistD
cat tmplist2C | sed "s^_ ^_^g" | sed "s^/ ^/^g" > tmplist2D
IFS=$'\r\n' GLOBIGNORE='*' command eval  'renamedarray=($(cat tmplistD))'
IFS=$'\r\n' GLOBIGNORE='*' command eval  'renameddirarray=($(cat tmplist2D))'

i=0
for d in "${renamedarray[@]}"; do
    b="$(basename "$d")"
    pathtofile="$(dirname "${myarray[$i]}")"
    mv "${myarray[$i]}" "$pathtofile/$b"
    i=$((i+1))
done

This now does all the hard work on a text file instead of the real files and uses the sed -s (thanks @Mark-Setchell nad @thanasisp)

Way faster. Now it's down to 15 seconds for 1000 files, without having to split the array into 16 threads. It would only take me a little over 4 hours to rename a million files!

(also it does files first, and then folders, in order to avoid renaming a folder before the files are renamed inside it)

So, as far as I'm concerned, this is good enough for what I need to do, but I didn't think I should answer my question with the above, as I feel it could be even more efficient. For example, with the perl and awk solution. (however I'm still trying to understand these... keep googling)

user122121
  • 130
  • 6
  • 1
    You could stop creating so many processes, e.g. change `sed thingA | sed thingB` to `sed -e thingA -e thingB`. You could do all the`mv` commands by putting original and new names in a file and running the file through **Perl** so each rename is just a library call not a whole new process. – Mark Setchell Oct 09 '20 at 06:37
  • Indeed, use another language. Bash creates processes for each command. With Perl (one example) it will create only 1 process, and it will be much much faster. You could also start a couple instances, like a*, b*, ... z* to process many files at once. – Nic3500 Oct 09 '20 at 06:41
  • 1
    I suggest you focus into describing in a clear way the rename rules, for exampe what `A ^a` means? Currently what you present as content of `capslist.txt` is not a well-communicated format to describe patterns replacing. Maybe also include a few representative filenames before and after the renaming. Also, you should use the `rename` command, as stated above, if it exists for your environment. Bash should not be used like a programming language, as in your examples, by as a high-level task manager. If you call ten processes to rename one file, it will be really slow. – thanasisp Oct 09 '20 at 07:29
  • 1
    You spawn a **minimum** of 17 subshells per function call plus 3 more for every additional element in `"${capsarray[@]}"`. That will take *Forever* for a million files. Rather than showing a partial `caplist.txt`, you need to explain (map out) the rules you are using to modify the filenames and then you can probably write a single `awk` script to apply each rule to each filename in a single call to `awk`. That would be orders of magnitude faster and likely handle a million files in about a minute. – David C. Rankin Oct 09 '20 at 07:44
  • @David C. Rankin Yes, it's very inefficient. Hence this question. I added an example to make the rules more clear, but as you can see I need to match a large amount of things, such as countries, cities, etc. I didn't know that `awk` could do such thing in one script. Would you mind giving me an example? – user122121 Oct 09 '20 at 16:21
  • Each rule in `caplist.txt` can be an `awk` rule, applied to every filename in a single call to `awk`. For example `/A/{gsub(/A/, "a")}` will convert every `'A'` to `'a'` in a record. – David C. Rankin Oct 09 '20 at 19:49

0 Answers0