1) where most of the time is lost executing an algorithm
Use a profiler to find hot spots. It's not even worth looking at the compiler's asm output for code that isn't part of an important loop.
2) whether writing inline assembly will really enhance execution speed
Look at the compiler's asm output and see if it's doing something stupid, and that you could do better. This requires knowing the microarchitecture you're targeting, so you know what's slow vs. fast. If you're targeting x86, see the x86 tag wiki for perf guides (e.g. Agner Fog's optimizing assembly guide, microarchitecture guide, and instruction tables, as well as Intel's optimization manual)
As @chqrlie points out, any hand-written asm will also be tuned for some specific microarchitecture, and may not be optimal on future CPUs. Out-of-order execution often hides instruction-ordering issues, but not all ARM CPUs are out-of-order, so scheduling matters.
Your first attempt should be to tweak the C to guide the compiler into a smarter way of implementing the same logic, like I did in this answer.
If the problem is vectorizable, but the compiler doesn't auto-vectorize it, your first course of action should be to manually vectorize it with intrinsics, not with inline-asm. Compilers can do a good job optimizing code that uses intrinsics.
Writing inline asm (or whole function in asm that you call from C) should be a last resort. Besides the portability and maintainability problems, inline asm defeats compiler optimizations like constant-propagation. See https://gcc.gnu.org/wiki/DontUseInlineAsm.
If one of the inputs to your function is a compile-time-constant (after inlining and link-time optimization), a C implementation (with intrinsics) will simplify to the special case for that constant input.
But an inline-asm version won't simplify at all. The compiler will just MOV constant values into registers and run your asm as written. In GNU C, you can sometimes detect and avoid this by asking the compiler whether an input is a compile-time-constant. e.g. if(__builtin_constant_p(some_var)) { C implementation } else { asm(...); }
. Unfortunately, clang doesn't propagate compile-time-constantness through function inlining, so it's always false for function args :(
And finally, if you think you can beat the compiler, make sure you actually succeeded by running a benchmark once you're done, against the best C implementation you can come up with.