Assuming you don't have a way to cheat (like some way of getting that information from the operating system or some CPU identification register):
The basic idea is that (by design), your L1 cache is faster than your L2 cache which is faster than your L3 cache... In any normal design, your L1 cache is also smaller than your L2 cache which is smaller than your L3 cache...
So you want to allocate a large-ish block of memory and then access (read and write) it sequentially[1] until you notice that the time taken to perform X accesses has risen sharply. Then keep going until you see the same thing again. You would need to allocate a memory block larger than the largest cache you are hoping to discover.
This requires access to some low-overhead access timestamp counter for the actual measurement (as pointed out in the referrered-to answer).
[1] or depending on whether you want to try to fool any clever prefetching that may skew the results, randomly within a sequentially progressing N-byte block.