x86-32 (aka IA-32) is a 32-bit extension to 16-bit 8086, which was designed for easy porting of asm source from 8-bit 8080 to 8086. (Why are first four x86 GPRs named in such unintuitive order? on retrocomputing).
This history is why modern x86 has so much partial-register stuff, with direct support for 8 and 16-bit operand-size.
Most other architectures with 32-bit registers only allow narrow loads/stores, with ALU operations being only full register width. (But that's as much because they're RISC architectures (MIPS, SPARC, and even the slightly less-RISCy ARM), while x86 is definitely a CISC architecture.)
The 64-bit extensions to RISC architectures like MIPS still support 32-bit operations, usually implicitly zero-extending 32-bit results into the "full" registers the same way x86-64 does. (Especially if 64-bit isn't a new mode, but rather just new opcodes within the same mode, with semantics designed so that existing machine code will run the same way when addressing modes use the full registers but all the legacy opcodes still only write the low 32 bits.)
So the situation you observe on x86-32 (with narrow operations on partial registers being supported) is present in all architectures that exist as a wider extension to an older architecture, whether it runs in a new mode (where machine code decodes differently) or not. It's just that x86 ancestry goes back to 16-bit within x86, and back to 8-bit as an influence on 8086.
Motorola 68000 has 32-bit registers, according to Wikipedia "the main ALU" is only 16-bit. (Maybe 32-bit operations are slower or some are not supported, but definitely 32-bit add/and instructions are supported. I don't know the details behind why Wikipedia says that).
Originally 68000 was designed to work with a 16-bit external bus, so 16-bit loads/stores were more efficient on those early CPUs. I think later 68k CPUs widened the data buses, making 32-bit load/store as fast as 16-bit. Anyway, I think m68k is another example of a 32-bit architecture that supports a lot of 16-bit operations. Wikipedia describes it as "a 16/32-bit CISC microprocessor".
With the addition of caches, twice as many 16-bit integers fit in a cache line as 32-bit integers, so for sequential access 16-bit only costs half as much average / sustained memory bandwidth. Talking about "bus width" gets more complicated when there's cache, so there's a bus or internal data path between the load/store units and cache, and between cache and memory. (And in a multi-level cache, between different levels of cache).
Deciding whether to call an architecture (or a specific implementation of that architecture) 8 / 16 / 32 / 64-bit is pretty arbitrary. The marketing department is likely to pick the widest thing they can justify, and use that in descriptions of the CPU. That might be a data bus or a register width, or address space or anything else. (Many 8-bit CPUs use 16-bit addresses in a concatenation of two 8-bit registers, although most of them don't try to claim to be 16-bit. They might be advertized as 8/16-bit though.)
32-bit x86 is considered 32-bit because that's the maximum width of a pointer, or a "general purpose" integer register. 386 added several major new things: 32-bit integer registers / operand size (accessible with prefixes from real mode), and 32-bit protected mode with virtual memory paging, where the default address and operand sizes are 32 bits.
The physical CPUs that can run IA-32 machine code today have vastly wider buses and better memory bandwidth than the first generation 386SX CPUs, but they still support the same IA-32 architecture (plus extensions).
These days, essentially all new x86 CPUs can also run in x86-64 mode. When running in IA-32 mode, a modern x86 CPU will only be using the low 32 bits of its 64-bit physical integer registers (like for instructions that use 32-bit operand-size in 32-bit or 16-bit mode).
But besides the integer registers, there are the 80-bit x87 registers (which can be used as 64-bit integer-SIMD MMX registers), and also the XMM / YMM / ZMM registers (SSE / AVX / AVX512).
SSE2 is baseline for x86-64, and can be assumed in most 32-bit code these days, so at least 128-bit registers are available, and can be used for 64-bit integer add/sub/shift even in 32-bit mode with instructions like paddq
.
Modern CPUs also have at least 128-bit connections between the vector load/store units and cache, so load/store/copy bandwidth when data fits in L1d cache is not limited by the external dual/triple/quad-channel DDR3/DDR4 DRAM controllers (which do burst transfers of 8x 64-bits = one 64-byte cache line over 64-bit external buses).
Instead, CPUs have large fast caches, including a shared L3 cache so data written by one core and read by another doesn't usually have to go through memory if it's still hot in L3. See some details on how cache can be that fast for Intel IvyBridge, which has only 128-bit load/store paths even though it supports 256-bit AVX instructions. Haswell widened the load/store paths to 256-bit as well. Skylake-AVX512 widened the registers and data paths to 512-bit for L1d cache, and the connection between L1d and L2.
But on paper, x86 (since P5 Pentium and later) only guarantees that aligned loads/stores up to 64 bits are atomic, so implementations with SSE are allowed to split 128-bit XMM loads/stores into two 64-bit halves. Pentium III and Pentium M actual did this. But note that i586 Pentium predated x86-64 by a decade, and the only way it could load/store 64 bits was with x87 fld
or fild
. Pentium MMX could do 64-bit MMX movq
loads/stores. Anyway, this atomicity guarantee includes uncached stores (e.g. for MMIO), which was possible (cheaply, without a bus-lock) because the P5 microarchitecture has a 64-bit external bus, even though it's strictly 32-bit other than the FPU.
Even pure integer code benefits from the wide data paths because it increases bandwidth for integer code with loads/stores that hit in L3 or especially L2 cache, but not L1d cache.
All these SIMD extensions to x86 make it vastly more powerful than a purely 32-bit integer architecture. But when running in 32-bit mode, it's still the same mode introduced by 386, and we call it 32-bit mode. It's as good a name as any, but don't try to read too much into it.
In fact, don't read anything into it except for integer / pointer register widths. The hardware it runs on typically has 64-bit integer registers, and 48-bit virtual address space. And data buses + caches of various huge widths, and complex out-of-order machinery to give the illusion of running in-order while actually looking at a window of up to 224 uops to find instruction-level parallelism. (Skylake / Kaby Lake / Coffee Lake ROB size).