Your real bottleneck with MESH is that each RTCPeerConnection will do its own video encoding in the browser.
The p2p concept naturally includes the requirement that both peers should adjust encoding quality based on network conditions. So, when your browser sends two streams to peers X (good download speed) and Y (bad download speed), the encodings for X and Y will be different - Y will receive lower framerate and bitrate than X.
Sounds reasonable, right? But, unfortunately, mandates separate video encoding for each peer connection.
If multiple peer connections could re-use the same video encoding, then MESH would be much more viable. But Google didn't provide that option in the browser. Simulcast requires SFU, so that's not your case.
So, how many concurrent video encodings can browser perform on a typical machine, for 720p 30 fps video? 5-6, not more. For 640x480 15 fps? Maybe 20 encodings.
In my opinion, the encoding layer and networking layer could be separated in WebRTC design, and even getUserMedia could be extended to getEncodedUserMedia, so that you could send the same encoded content to multiple peers.
So that's the real practical reason people use SFU for multi-peer WebRTC.