How sum grouped in Cuda -
i know techniques apply add dimensions of array , save new vet lower in following example:
a -> [1,2], [3,4], [5,6]
b -> [3], [7], [11]
figure: http://snag.gy/83qwl.jpg
if want write own cuda kernel, take @ vector add sample. instead of passing 2 input vectors kernel, pass a
, provide loop sum on "rows" of a
:
__global__ void mykernel(int *a, int *b, int rows, int cols){ int idx=threadidx.x+blockdim.x*blockidx.x; if (idx < rows) { int sum = 0; (int i=0; i< cols; i++) sum += a[(idx*cols)+i]; b[idx] = sum; } }
this won't terribly efficient, can improve efficiency if can store a
array in column major order:
a -> [1,3,5], [2,4,6]
then modification above kernel becomes pretty efficient:
__global__ void mykernel(int *a, int *b, int rows, int cols){ int idx=threadidx.x+blockdim.x*blockidx.x; if (idx < rows) { int sum = 0; (int i=0; i< cols; i++) sum += a[(i*cols)+idx]; b[idx] = sum; } }
if you're looking efficiency can't reorganize data, segmented parallel reduction fastest. can try creating based on cuda sample codes suggest using thrust, reduce_by_key
you leave a
array , use "values":
a -> [1,2], [3,4], [5,6]
and create "key" array corresponds rows of array:
k -> [0,0], [1,1], [2,2]
Comments
Post a Comment