There is a section on this in the book, "Parallel Computer Architecture: A Hardware/Software Approach" (perhaps a bit outdated).
Shared Address Space without Coherent Replication
Systems in this category support a shared address space abstraction through the language and compiler but without automatic replication and coherence, just like the CRAY T3D and T3E did in hardware. One type of example is a data parallel language like High Performance Fortran (see Chapter 2). The distributions of data specified by the user, together with the owner computes rule, are used by the compiler or run-time system to translate off-node memory references to explicit messages, to make messages larger, to align data for better spatial locality, and so on. Replication and coherence are usually left up to the user, which compromises ease of programming; alternatively, system software may try to manage coherent replication in main memory automatically. Efforts similar to HPF are being made with languages based on C and C++ as well (Bodin et al. 1993; Larus, Richards, and Viswanathan 1996).
A more flexible language- and compiler-based approach is taken by the Split-C language (Culler et al. 1993). Here, the user explicitly specifies arrays as being local or global (shared) and for global arrays specifies how they should be laid out among physical memories. Computation may be assigned independently of the data layout, and references to global arrays are converted into messages by the compiler or run-time system based on the layout. The decoupling of computation assignment from data distribution makes the language much more flexible than an owner computes rule for load-balancing irregular programs, but it still does not provide automatic support for replication and coherence, which can be difficult for the programmer to manage. Of course, all these software systems can be easily ported to hardware-coherent shared address space machines, in which case the shared address space, replication, and coherence are implicitly provided. In this case, the run-time system may be used to manage replication and coherence in main memory and to transfer data in larger chunks than cache blocks, but these capabilities may not be necessary.
The languages based-on C++ mentioned above are:
Parallel Programming in C**: A Large-Grain Data-Parallel Programming Language
Implementing a parallel C++ runtime system for scalable parallel systems
Parallel programming in Split-C
These look like predecessors of CUDA. So, lack of coherency perhaps make sense for massively-parallel workloads, for which the relatively slow synchronizations (due to a lack of coherency) could still account for only a tiny fraction of the overall runtime.
CRAY T3D and T3E indeed had a shared address space without hardware-supported coherency.