I wrote a bash function that turns filenames into title case, then capitalizes acronyms, camera _DSC/_IMG prefixes, etc. It also makes conjunctions lowercase.
The problem is that it is very slow. I ran it on 1000 files as a test, and it took 4m40s to finish. And I have over 1 million files to rename.
The function takes a bunch of sed
replacements out of a file, and runs each of these to convert the needed caps/lowercase after the initial conversion.
Here is the function:
(I can only have BSD/OSX bash so ${a,,}
and sed 's/./\L&/g'
do not work for me...)
IFS=$'\r\n' GLOBIGNORE='*' command eval 'capsarray=($(cat capslist.txt))'
tc () {
capped="$(echo "$1" | tr [[:upper:]] [[:lower:]] | sed "s^_^_ ^g" | sed "s^/^/ ^g" | awk '{for(i=1;i<=NF;i++){ $i=toupper(substr($i,1,1)) substr($i,2) }}1')"
for d in "${capsarray[@]}"; do
capped2="$(echo "$capped" | sed "s^$d^g")"
capped="$capped2"
done
capped3="$(echo "$capped" | sed "s^_ ^_^g" | sed "s^/ ^/^g")"
echo "$capped3"
}
and here is part of capslist.txt
:
A ^a
W ^w
With ^with
The ^the
Is ^is
Of ^of
And ^and
Or ^or
But ^but
To ^to
In ^in
By ^by
Dsc^DSC
Img^IMG
_Mg^_MG
\(_[0-9][0-9]\)c^\1C
\(_[0-9][0-9]\)s^\1S
\(_[0-9][0-9]\)a^\1A
_Dji^_DJI
Dji_^DJI_
Img_^IMG_
_Kis\([0-9][0-9]\)^_KIS\1
Dk ^DK
Uk ^UK
Eu ^EU
etc...
There are a lot of entries in this list and i need to add even more for it to be complete.
Attempted solution to speed it up:
So I split the input files list to 16 arrays and do the function simultaneously on each portion. This makes it a lot faster, as now it can run on multiple cores.
Now I'm down to 1m30s on my 4-core (8 thread) machine.
But it's still not fast enough to do 1 million files.
Here is the glorious multicore bash beauty:
####### 16 THREADS:
declare -a array1
declare -a array2
declare -a array3
declare -a array4
declare -a array5
declare -a array6
declare -a array7
declare -a array8
declare -a array9
declare -a array10
declare -a array11
declare -a array12
declare -a array13
declare -a array14
declare -a array15
declare -a array16
total=${#inputfiles[@]}
div1="$(expr $total / 16)"
div2="$(expr $div1 + $div1)"
div3="$(expr $div1 + $div1 + $div1)"
div4="$(expr $div1 + $div1 + $div1 + $div1)"
div5="$(expr $div1 + $div1 + $div1 + $div1 + $div1)"
div6="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div7="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div8="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div9="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div10="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div11="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div12="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div13="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div14="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
div15="$(expr $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1 + $div1)"
array1=("${inputfiles[@]:0:$div1}")
array2=("${inputfiles[@]:$div1:$div1}")
array3=("${inputfiles[@]:$div2:$div1}")
array4=("${inputfiles[@]:$div3:$div1}")
array5=("${inputfiles[@]:$div4:$div1}")
array6=("${inputfiles[@]:$div5:$div1}")
array7=("${inputfiles[@]:$div6:$div1}")
array8=("${inputfiles[@]:$div7:$div1}")
array9=("${inputfiles[@]:$div8:$div1}")
array10=("${inputfiles[@]:$div9:$div1}")
array11=("${inputfiles[@]:$div10:$div1}")
array12=("${inputfiles[@]:$div11:$div1}")
array13=("${inputfiles[@]:$div12:$div1}")
array14=("${inputfiles[@]:$div13:$div1}")
array15=("${inputfiles[@]:$div14:$div1}")
array16=("${inputfiles[@]:$div15}")
function tcmv {
b="$(basename "$d")"
pathtofile="$(dirname "$d")"
mv "$d" "$pathtofile"/"$(tc "$b")"
}
for d in "${array1[@]}"; do tcmv; done &
for d in "${array2[@]}"; do tcmv; done &
for d in "${array3[@]}"; do tcmv; done &
for d in "${array4[@]}"; do tcmv; done &
for d in "${array5[@]}"; do tcmv; done &
for d in "${array6[@]}"; do tcmv; done &
for d in "${array7[@]}"; do tcmv; done &
for d in "${array8[@]}"; do tcmv; done &
for d in "${array9[@]}"; do tcmv; done &
for d in "${array10[@]}"; do tcmv; done &
for d in "${array11[@]}"; do tcmv; done &
for d in "${array12[@]}"; do tcmv; done &
for d in "${array13[@]}"; do tcmv; done &
for d in "${array14[@]}"; do tcmv; done &
for d in "${array15[@]}"; do tcmv; done &
for d in "${array16[@]}"; do tcmv; done &
wait
EDIT: Example conversions with the custom case called out:
20200908_LA_HOLLYWOOD AND HIGHLAND_BUSY STREET_TIME LAPSE_DSC0795.NEF
20200908_LA_Hollywood and Highland_Busy Street_Time Lapse_DSC0795.NEF
^^ ^^^ ^^^ ^^^
20180706_STADIUM_SFO VS NYC_BASKETBALL MATCH_04B8897_OK TO PUBLISH.JPG
20180706_Stadium_SFO vs NYC_Basketball Match_04B8897_OK to Publish.JPG
^^^ ^^ ^^^ ^ ^^ ^^ ^^^
So my question is:
Is there a way to make this faster? And how?
EDIT2:
Based on the suggestions in the comments, I was able to come up with something a lot better:
IFS=$'\r\n' GLOBIGNORE='*' command eval 'capsarray=($(cat capslist.txt))'
cat tmplist | tr [[:upper:]] [[:lower:]] | sed "s^_^_ ^g" | sed "s^/^/ ^g" | awk '{for(i=1;i<=NF;i++){ $i=toupper(substr($i,1,1)) substr($i,2) }}1' > tmplistB
cat tmplist2 | tr [[:upper:]] [[:lower:]] | sed "s^_^_ ^g" | sed "s^/^/ ^g" | awk '{for(i=1;i<=NF;i++){ $i=toupper(substr($i,1,1)) substr($i,2) }}1' > tmplist2B
commandvar="sed"
commandvar2="sed"
for d in "${capsarray[@]}"; do
commandvar+=" -e "s^$d^g""
commandvar2+=" -e "s^$d^g""
done
commandvar+=' > tmplistC'
commandvar2+=' > tmplist2C'
cat tmplistB | eval $commandvar
cat tmplist2B | eval $commandvar2
cat tmplistC | sed "s^_ ^_^g" | sed "s^/ ^/^g" > tmplistD
cat tmplist2C | sed "s^_ ^_^g" | sed "s^/ ^/^g" > tmplist2D
IFS=$'\r\n' GLOBIGNORE='*' command eval 'renamedarray=($(cat tmplistD))'
IFS=$'\r\n' GLOBIGNORE='*' command eval 'renameddirarray=($(cat tmplist2D))'
i=0
for d in "${renamedarray[@]}"; do
b="$(basename "$d")"
pathtofile="$(dirname "${myarray[$i]}")"
mv "${myarray[$i]}" "$pathtofile/$b"
i=$((i+1))
done
This now does all the hard work on a text file instead of the real files and uses the sed -s
(thanks @Mark-Setchell nad @thanasisp)
Way faster. Now it's down to 15 seconds for 1000 files, without having to split the array into 16 threads. It would only take me a little over 4 hours to rename a million files!
(also it does files first, and then folders, in order to avoid renaming a folder before the files are renamed inside it)
So, as far as I'm concerned, this is good enough for what I need to do, but I didn't think I should answer my question with the above, as I feel it could be even more efficient. For example, with the perl and awk solution. (however I'm still trying to understand these... keep googling)