How can I efficiently remove zero values from an array in parallel using CUDA. The information about the number of zero values is available in advance, which should simplify this task.
It is important that the numbers remain ordered as in the source array, when being copied to the resulting array.
Example:
The array would e.g. contain the following values: [0, 0, 19, 7, 0, 3, 5, 0, 0, 1] with the additional information that 5 values are zeros. The desired end result would then be another array containing: [19, 7, 3, 5, 1]