How sum grouped in Cuda -


i know techniques apply add dimensions of array , save new vet lower in following example:

a -> [1,2], [3,4], [5,6]

b -> [3], [7], [11]

figure: http://snag.gy/83qwl.jpg

if want write own cuda kernel, take @ vector add sample. instead of passing 2 input vectors kernel, pass a , provide loop sum on "rows" of a:

__global__ void mykernel(int *a, int *b, int rows, int cols){   int idx=threadidx.x+blockdim.x*blockidx.x;   if (idx < rows) {     int sum = 0;     (int i=0; i< cols; i++)       sum += a[(idx*cols)+i];     b[idx] = sum;     } } 

this won't terribly efficient, can improve efficiency if can store a array in column major order:

a -> [1,3,5], [2,4,6]  

then modification above kernel becomes pretty efficient:

__global__ void mykernel(int *a, int *b, int rows, int cols){   int idx=threadidx.x+blockdim.x*blockidx.x;   if (idx < rows) {     int sum = 0;     (int i=0; i< cols; i++)       sum += a[(i*cols)+idx];     b[idx] = sum;     } } 

if you're looking efficiency can't reorganize data, segmented parallel reduction fastest. can try creating based on cuda sample codes suggest using thrust, reduce_by_key

you leave a array , use "values":

a -> [1,2], [3,4], [5,6]  

and create "key" array corresponds rows of array:

k -> [0,0], [1,1], [2,2] 

Comments

Popular posts from this blog

java - Run a .jar on Heroku -

java - Jtable duplicate Rows -

validation - How to pass paramaters like unix into windows batch file -