No, multiply is much more complicated than XOR, ADD, OR, NOT, etc. While binary makes it much easier than base 10 you still have to have a larger adder (than just a two operand ADD or other operation).
Take the bits abcd
abcd
* 1011
========
abcd
abcd.
0000..
+abcd...
=========
In base 10 like grade school you had to multiply each time, you are still multiplying here but only by one or zero so either you copy and shift the first operand or you copy and shift zeros. And it gets very big, addition is cascaded. Look up xor gate at wikipedia and see the full adder or just google it. You have a single column adder for a simple two operand add with three inputs and two outputs but the carry out of one bit is the carry in of the other. No logic is instantaneous even a single transistor inversion (NOT) takes a non-zero amount of time. You can start to think about how many gates are lined up just to make one 32 bit two operand ADD, and then think about a 32 bit multiply where each adder is 32 operand bits and some number of carry bits, and then all of that is cascaded. The chip real estate and the time to settle multiply almost exponentially for multiply, and you then start to worry about can you meet timing (can you settle the msbit of the result within the desired/designed clock speed).
So you will see optimizations made including multiple pipe stages, not 32 clocks to do a 32 bit multiply but maybe not one clock maybe two or four. With a dozen stage deep pipe though you can bury that in there and still meet an advertised one clock per instruction average.
Intel, ARM, etc the 1 cycle thing is an illusion, the math operation itself might take that long, but the execution of the instruction takes a few to a handful, and your pipe depths may be several to a dozen or more. There is limited use in attempting to count cycles these days. And feeding the pipe and handling memory operations tend to dominate the performance not the pipe/instructions themselves outside a carefully crafted sim of the core.
For the cortex-ms which are perhaps not what you are asking about but are very much part of our daily life you see in the documentation that it is the chip vendor that can choose the larger faster multiply or the slower smaller that helps with overall chip size and perhaps performance. (I do not examine the cortex-a docs that much as I do not use them as often) A compile time option when they compile the core, there are many compile time options (which is why for any arm core cortex-m or cortex-a) you cannot compare, say, two cortex-m4s from different vendors or chip families within a vendor as they could have been compiled differently and behave/perform differently (they still execute the enabled instructions in the same functional way of course).
So no you cannot assume the "execution time" or "cycle time" of ANY instruction, and in particular ones like multiply and divide and anything floating point cannot assumed to be single cycle. Yes like all the other instructions the one cycle advertised is based on the pipeline effects, no instruction takes one cycle start to finish, and based on pipe depth of the design the multiply and divide may take more than one clock but be hidden by the pipe to still average one clock per instruction.
Note that this question is "too broad", as there are many Intel and ARM implementations past and present. And chip implementation details are often not available or protected by NDA, all you have if anything are public documents that can hide the reality.