Do you conclude that your optimizations were successful?
No. 3 runs is not enough especially due to the huge variation and the fact that some timings of the two groups are mixed once merged and sorted.
For small timings like this, the first run should be removed and at least dozens of runs should be performed. I would personally use at least hundreds of runs.
Do software engineers actually use these formal calculations in practice?
Only very few developers does advanced statistical analysis. It is often not needed to do something very formal when the gab before/after the target optimization is huge and the variation within groups is small.
For example, if your program is twice faster than before with a min-max variation of <5%, then you can quite safely say that the optimization is successful. That being said, it is sometimes not the case due to unexpected external factors (though it is very rare when the gap is so big).
If the result is not obvious, then you need to do some statistic basics. You need to compute the standard deviation, the mean and median time, remove the first run, interleave runs and use many runs (at least dozens). The distribution of the timings almost always follow a normal distribution due to the central limit theorem. It is sometimes a mixture distribution due to the threshold effects (eg. caching). You can plot the value to see that easily if you see some outliers in timings.
If there are threshold effects, then you need to apply an advanced statistical analysis but this is complex to do and generally it is not an expected behaviour. I is generally a sign that the benchmark is biased, there is a bug or a complex effect you have to consider during the analysis of the result anyway. Thus, I strongly advise you to fix/mitigate the problem before analysing the results in that case.
Assuming the timings follow a normal distribution, you can just check if the median is close to the mean and if the standard deviation is small compare to the gap between the mean.
A more formal way to do that is to compute the Student t-test and its associated p-value and check the significance of the p-value (eg. <5%). If there are more groups, An Anova can be used. If you are unsure about the distribution, you can apply non-parametric statistical tests like the Wilcoxon and Kruskal-Wallis tests (note that the statistical power of these test is not the same). In practice, doing such a formal analysis is time-consuming and it is generally not so useful compare to a naive basic check (using the mean and standard deviation) unless your modification impacts a lot of users or you plan to write research papers.
Keep in mind that using a good statistical analysis does not prevent biased benchmarks. You need to minimize the external factors that can cause biased results. One frequent bias is frequency scaling: the first benchmark can be faster than the second because of turbo-boost or it can be slower because the processor can take some time to reach a high frequency. Caches also plays a huge role in benchmark biases. There are many other factors that can cause biases in practice like the compiler/runtime versions, environment variables, configuration files, OS/driver updates, memory alignment, OS paging (especially on NUMA systems), the hardware (eg. thermal throttling), software bugs (it is not rare to find bugs by analysing strange performance behaviours), etc.
As a result, it is critical to make benchmarks as reproducible as possible (by fixing versions and reporting the environment parameters (as well as possibly run the benchmarks in a sandbox if you are paranoid and if it does not affect too much the timings). Software like Nix/Spack help for packaging, and containers like LXD, Docker could help for a more reproducible environment.
Many big software team use automated benchmarking to check the presence of performance regression. Tools can do the run properly and statistical analysis for you regularly. A good example is the Numpy team which use a package called Airspeed Velocity (see the results). The PyPy team also designed their own benchmarking tool. The Linux kernel also have benchmarking suite to check for regression (eg. PTS) and many company focusing on performance have such automated benchmarking tools (often home-made). There are many existing tools for that.
For more information about this topic, please give a look to the great Performance Matters presentation by Emery Berger.