0

I am trying to check how many different images exits in a folder that contains images that can be the same but with different names. For that, I am using their md5 sums to see if two images are the same.

I do not know if there is a faster way for achieving the same results, but I am more interested in understanding why exits a really difference in performance if I execute the same code several times in a row.

I read this really good post of time command but did not find any conclusion.

$ time md5 -q * | sort | uniq | wc -l
    1184

real    1m7.923s
user    0m1.408s
sys     0m0.796s

$ time md5 -q * | sort | uniq | wc -l
    1184

real    0m11.220s
user    0m1.345s
sys     0m0.686s

$ time md5 -q * | sort | uniq | wc -l
    1184

real    0m9.011s
user    0m1.321s
sys     0m0.595s

$ time md5 -q * | sort | uniq | wc -l
    1184

real    0m1.644s
user    0m1.257s
sys     0m0.386s

$ time md5 -q * | sort | uniq | wc -l
    1184

real    0m2.213s
user    0m1.267s
sys     0m0.408s

$ time md5 -q * | sort | uniq | wc -l
    1184

real    0m1.541s
user    0m1.253s
sys     0m0.380s

$ time md5 -q * | sort -u | wc -l
    1184

real    0m1.551s
user    0m1.253s
sys     0m0.387s

$ time md5 -q * | sort -u | wc -l
    1184

real    0m1.553s
user    0m1.255s
sys     0m0.388s

# Here I waited for 5 minutes.

$ time md5 -q * | sort -u | wc -l
    1184

real    0m12.028s
user    0m1.352s
sys     0m0.720s

Is the real time variability due to execution priority? Should I just consider user time? Well, waiting one minute (real time) for a task that can be completed in just one second is really annoying...

FYI: I am executing the previous code in a MacOS High Sierra computer.

Miguel Isla
  • 1,379
  • 14
  • 25
  • 4
    the first time it has to read all the files into memory. The other times the files are still in memory so it's much faster. – Barmar Apr 26 '18 at 08:00
  • 1
    Possible answers to your question: https://superuser.com/a/638954 and https://unix.stackexchange.com/a/40207/281661. As Barmar pointed out, the reason why second command is faster is because of file caching. – builder-7000 Apr 26 '18 at 08:13
  • Mmm interesting. This may explain the difference between first and second commands, but between second and third? Or between third and fourth? More things being cached? – Miguel Isla Apr 26 '18 at 08:22
  • 3
    Do not expect command to have super close run times. You system always does other things which can affect performance (context switch, other programs, disk accesses, libraries access, ...). Even more if you are on a virtual system. time gives you a ball park figure, but do not read to much into it when it comes to diffrences in seconds. – Nic3500 Apr 26 '18 at 11:59
  • Thaks a lot for your responses. If some of you want to publish an answer, I will accept it. – Miguel Isla Apr 26 '18 at 15:17

1 Answers1

0

What happens when you enter the command for the first time, is that all the files have to be read from disk into your memory. Therefore you see, that user mode time spent is approximately equal to all the other runs, because here your MD5 sums get calculated, while system mode time is a bit higher than in the other runs, because here your disk access gets handled. Nonetheless, thanks to DMA, your system spends most of the time needed to read the stuff from disk doing other useful things. Thus, the real time is much higher than the user and sys time in the first run.

As pointed out by various comments already, the subsequent runs can be done much faster, because most of the data needed is already in memory, so you spend less time in system mode issuing calls to your disk, and also have to wait way shorter, because your disk is barely invoked anymore. Nonetheless, the hash sum calculation takes the same time as usual, because it is not influenced by the other factors that much, and therefor you see a pretty similar user mode time value. So overall, the real time value comes closer to the sum of user and sys time values.

Now, to the fluctuations in your subsequent calls: Parallelization is a lie, at least most the time. Your computer does much more things seemingly "in parallel" than he could truly do in parallel using the few physical cores available. Instead, he switches extremely fast between all the tasks he has to do, and does always just a bit of the work, so the user gets the impression, that everything runs in parallel. Nonetheless, between and during the subsequent calls to your script, your computer might have done different things, which utilize your systems resources as well. This might cause parts of your cached data to be dumped, so it has to be load again from disk for the next call, but only partly. This multiplexing of all the jobs to do is also, what induces those sub-second fluctuations that are hard to explain. It just all the interferences with all the other stuff running "in parallel".

But overall, don't worry: The first time you run the command, your machine really needs this minute. Its just that either your disk is really slow, or its use is multiplexed with the needs of other processes, or the amount of data to be read is really high, or most likely a mixture of everything!

Wanderer
  • 272
  • 5
  • 15