How sum grouped in Cuda -
i know techniques apply add dimensions of array , save new vet lower in following example:
a -> [1,2], [3,4], [5,6]
b -> [3], [7], [11]
figure: http://snag.gy/83qwl.jpg
if want write own cuda kernel, take @ vector add sample. instead of passing 2 input vectors kernel, pass a , provide loop sum on "rows" of a:
__global__ void mykernel(int *a, int *b, int rows, int cols){ int idx=threadidx.x+blockdim.x*blockidx.x; if (idx < rows) { int sum = 0; (int i=0; i< cols; i++) sum += a[(idx*cols)+i]; b[idx] = sum; } } this won't terribly efficient, can improve efficiency if can store a array in column major order:
a -> [1,3,5], [2,4,6] then modification above kernel becomes pretty efficient:
__global__ void mykernel(int *a, int *b, int rows, int cols){ int idx=threadidx.x+blockdim.x*blockidx.x; if (idx < rows) { int sum = 0; (int i=0; i< cols; i++) sum += a[(i*cols)+idx]; b[idx] = sum; } } if you're looking efficiency can't reorganize data, segmented parallel reduction fastest. can try creating based on cuda sample codes suggest using thrust, reduce_by_key
you leave a array , use "values":
a -> [1,2], [3,4], [5,6] and create "key" array corresponds rows of array:
k -> [0,0], [1,1], [2,2]
Comments
Post a Comment