2

I'm implementing sha512 in OpenCL technology. I have simple definition of kernel function

__kernel void _sha512(__global char *message, const uint length, __global char *hash);

On host I have implemented and successfully tested implementation of sha512 algorithm.

I have a problem with copy data from message array to temporary variable called character.

char character = message[i];

Where i is a loop variable - in range from 0 to message's size.

When I tried to run my program there I got this errors

0x00007FFD9FA03D54 (0x0000000010CD0F88 0x0000000010CD0F88 0x0000000010BAEE88 0x000000001A2942A0), nvvmCompilerProperty() + 0x26174 bytes(s)
...
0x00007FFDDFA70D51 (0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000), RtlUserThreadStart() + 0x21 bytes(s)
0x00007FFDDFA70D51 (0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000), RtlUserThreadStart() + 0x21 bytes(s)

I readed about async_work_group_copy() but I can't understand how to use it - in docs I can't found any example code.

I have tried with char character = (__private char) message[i]; but it's not working too.

I don't understand how to pass last parameter into async_work_group_copy() and how to use it to copy data from __global memory into __private memory.

ventaquil
  • 2,780
  • 3
  • 23
  • 48

1 Answers1

1

OpenCL by default does not allow single-byte access in kernels: memory access needs to be in multiples of 4 bytes, aligned to 4-byte boundaries. If your implementation supports it, you can enable byte-wise memory accesses. This involves the cl_khr_byte_addressable_store extension, which you need to check for and explicitly enable in your kernel source. Give that a try and see if it solves your problem.

To use async_work_group_copy, try something like this:

#define LOCAL_MESSAGE_SIZE 64 // or some other suitable size for your workgroup
__local char local_message[LOCAL_MESSAGE_SIZE];
event_t local_message_ready = async_work_group_copy(local_message, message, LOCAL_MESSAGE_SIZE, 0);
// ...

// Just before you need to use local_message's content:
wait_group_events(1, &local_message_ready);
// Use local_message from here onwards

Note that async_work_group_copy is not required; you can access global memory directly. Which will be faster depends on your kernel, OpenCL implementation, and hardware.

Another option (the only option if your implementation/hardware do not support cl_khr_byte_addressable_store) is to fetch your data in chunks of at least 4 bytes. Declare your message as a __global uint* and unpack the bytes by shifting and masking:

uint word = message[i];
char byte0 = (word & 0xff);
char byte1 = ((word >> 8) & 0xff);
char byte2 = ((word >> 16) & 0xff);
char byte3 = ((word >> 24) & 0xff);
// use byte0..byte3 in your algorithm

Depending on implementation, hardware, etc. you may find this to be faster than bytewise access. (You may want to check if you need to reverse the unpacking by reading the CL_DEVICE_ENDIAN_LITTLE property using clGetDeviceInfo if you're not sure if all your deployment platforms will be little-endian.)

pmdj
  • 22,018
  • 3
  • 52
  • 103