It depends on the internal implementation. Consider the following naive binary long division:
struct div_t{
int quot;
int rem;
};
struct div_t div(int dividend, int divisor){
_Bool dividend_is_negative = (dividend < 0),
divisor_is_negative = (divisor < 0),
result_is_negative = divisor_is_negative ^ dividend_is_negative;
unsigned quotient =0, shift, shifted;
//if (dividend_is_negative) dividend = -dividend;
divisor ^= -divisor_is_negative;
divisor += divisor_is_negative;
//if (dividend_is_negative) dividend = -dividend;
dividend ^= -dividend_is_negative;
dividend += dividend_is_negative;
shifted = divisor;
//shift divisor so its MSB is same as dividend's - minimize loops
//if no builtin clz, then shift divisor until its >= dividend
//such as: while (shifted<dividend) shifted<<=1;
shift = __builtin_clz(divisor)-__builtin_clz(dividend);
//clamp shift to 0 to avoid undefined behavior
shift &= -(shift > 0);
shifted<<=shift;
do{
unsigned tmp;
quotient <<=1;
tmp = (unsigned long) (shifted <= dividend);
quotient |= tmp;
//if (tmp) dividend -= shifted;
dividend -= shifted & -tmp;
shifted >>=1;
}while (shifted >= divisor);
//if (result_is_negative) quotient=-quotient, dividend=-dividend;
quotient ^= -result_is_negative;
dividend ^= -result_is_negative;
quotient += result_is_negative;
dividend += result_is_negative;
return (struct div_t){quotient, dividend};
}
Obviously for smaller numbers the loop terminates sooner in the naive implementation.
If the whole loop were unrolled and parallelized in circuit it would use a lot of chip space since each bit has dependencies on the higher bits. This is often worth it for desktops, servers and super computers but not so much for mobile, embedded or IoT.
Many configurable chipsets (risc-v, j-core, etc) have options for fast vs slow division while some leave it out altogether and let the compiler do something like my example.