After further experimentation it is also possible to do this in C++. But it is not straightforward and I did not find a portable way of doing it. I can't post the code as it was done at work.
Essentially, from main, create new processes (where n is the number of cores available), capture the results from each shard, merge and output to the screen.
To get each process running a different shard, the total number of shard and instance number is given to the child process by the controller.
This is done by retrieving and copying the current environment, and setting in the copy the two environment variables (GTEST_TOTAL_SHARDS and GTEST_SHARD_INDEX) as required. GTEST_TOTAL_SHARDS is always the same, but GTEST_SHARD_INDEX will be the instance number of the child.
Merging the results is tedious but straightforward string manipulation. I successfully managed to get a correct total at the end, adding up the results of all the separate shards.
I was using Windows, so used CreateProcessA to create the new processes, passing in the custom environment.
It turned out that creating new processes takes a significant amount of time, but my program was taking about 3 minutes to run, so there was good benefits to be had from parallel running - the time came down to about 30 seconds on my 12 core PC.
Note that if this all seems overkill, there is a python program which does what I have described here but using a python script (I think - I haven't used it). This might be more straight forward.