Just for anyone else coming across this, spending half an hour with the CUDA API in one hand, and the PyCUDA documentation in another does wonders. Its much simpler than my initial experiments indicated.
Runtime Kernel Info
Incoming lazy lazy code
...
kernel=mod.get_function("foo")
meminfo(kernel)
...
def meminfo(kernel):
shared=kernel.shared_size_bytes
regs=kernel.num_regs
local=kernel.local_size_bytes
const=kernel.const_size_bytes
mbpt=kernel.max_threads_per_block
print("=MEM=\nLocal:%d,\nShared:%d,\nRegisters:%d,\nConst:%d,\nMax Threads/B:%d" % (local,shared,regs,const,mbpt))
Example Output
=MEM=
Local:24,
Shared:64,
Registers:18,
Const:0,
Max Threads/B:512
Static Device Info
Incoming lazy lazy code
import pycuda.autoinit
import pycuda.driver as cuda
(free,total)=cuda.mem_get_info()
print("Global memory occupancy:%f%% free"%(free*100/total))
for devicenum in range(cuda.Device.count()):
device=cuda.Device(devicenum)
attrs=device.get_attributes()
#Beyond this point is just pretty printing
print("\n===Attributes for device %d"%devicenum)
for (key,value) in attrs.iteritems():
print("%s:%s"%(str(key),str(value)))
Example Output
Global memory occupancy:70.000000% free
===Attributes for device 0
MAX_THREADS_PER_BLOCK:512
MAX_BLOCK_DIM_X:512
MAX_BLOCK_DIM_Y:512
MAX_BLOCK_DIM_Z:64
MAX_GRID_DIM_X:65535
MAX_GRID_DIM_Y:65535
MAX_GRID_DIM_Z:1
MAX_SHARED_MEMORY_PER_BLOCK:16384
TOTAL_CONSTANT_MEMORY:65536
WARP_SIZE:32
MAX_PITCH:2147483647
MAX_REGISTERS_PER_BLOCK:8192
CLOCK_RATE:1500000
TEXTURE_ALIGNMENT:256
GPU_OVERLAP:1
MULTIPROCESSOR_COUNT:14
KERNEL_EXEC_TIMEOUT:1
INTEGRATED:0
CAN_MAP_HOST_MEMORY:1
COMPUTE_MODE:DEFAULT
MAXIMUM_TEXTURE1D_WIDTH:8192
MAXIMUM_TEXTURE2D_WIDTH:65536
MAXIMUM_TEXTURE2D_HEIGHT:32768
MAXIMUM_TEXTURE3D_WIDTH:2048
MAXIMUM_TEXTURE3D_HEIGHT:2048
MAXIMUM_TEXTURE3D_DEPTH:2048
MAXIMUM_TEXTURE2D_ARRAY_WIDTH:8192
MAXIMUM_TEXTURE2D_ARRAY_HEIGHT:8192
MAXIMUM_TEXTURE2D_ARRAY_NUMSLICES:512
SURFACE_ALIGNMENT:256
CONCURRENT_KERNELS:0
ECC_ENABLED:0
PCI_BUS_ID:1
PCI_DEVICE_ID:0
TCC_DRIVER:0