There is no reason to use assembly at all for this assignment. Your solution does not require assembly C will work as well. I assume you are running on top of an operating system, so that is going to get in the way of your measurement, both executing things you think you know where they are and also in measuring what you think you are measuring.
Cache basics as far as taking these measurements is concerned...lets say there are four layers of memory. L1, the fastest, but also most expensive and smallest. Then L2. slower, not as expensive, likely larger than L1 in size. L3 less expensive, slower, larger, and then main memory the slowest, cheapest, and largest.
Lets just say we have four chunks of memory we are going to work with A, B, C, and D. L1 can only hold one chunk at a time. L2 two at a time, L3 three of the four and main memory all four.
If we do a read it first goes through L1, if there is a miss then L2, if a miss then L3, and if a miss then it will always be in main memory. Understand though this data is cached on the way back so L3, L2, and L1 will all contain the data just read, evicting as necessary (not always true but assume this simple cache model to understand how to complete your task). So if we read chunk A, then L1, L2, and L3 will all contain chunk A. Now in this hypothetical model if we read chunk B then L1 will contain B, evicting A. L2 will contain A and b and l3 will contain A and B. Read C and L1 will contain C, evicting B, lets say that L2 chooses to evict A, and contains B and C, and L3 contains A, B, and C. Read D and L1 will contain C lets say L2 evicts B and contains C and D, and say L3 evicts A and contains B, C, and D.
Assume that we dont exactly know how each cache chooses what to evict and what to keep. But assume that we do know or can figure out from motherboard specs or other sources, how large each cache is. If the above example happened in that order and L1 has D, L2 has C and D, L3 has B, C, and D and main has all four a,b,c,d. Then if when in that state we read all of block A and time it we are in theory reading it from main memory, it is not purely just the time to read that memory but also if any of the memory being evicted has changed it has to be written upstream possible hits all the way. But ideally if you were mostly doing just reads, then you will be timing mostly the reads.
Lets say that we found ourselves in the situation where chunk D is in l1, c and d in l2, b,c,d in l3 and we read all of chunk B and time it. Wouldnt that be measuring the time to access L3? with those same starting conditions then reading C would give us l2 timing. With those same starting conditions then reading D would be l1 timing right?
The trick is to get yourself into those conditions. The sizes of the caches are likely not such that l2 is twice the size of l1 and so on so to completely control what is in L1 you need to read enough data to fill L1. Moreso if you were to read L3 size amount of data then in theory L3 has all that data, L2 has the last L2 amount of that data and L1 has the last L1 amount of that data.
Using the data cache is easier than instruction cache but you can do this either way, you need at least L3 sized amount of instructions in main memory, a large quantity of nops. executing a linear chunk of instructions is no different than reading a linear chunk of memory. as far as read cycles goes. Instruction is easier as far as enabling and using the I cache. To enable data cache may or may not be simple based on your operating system and how you are managing memory.