How sum grouped in Cuda -

- July 15, 2010

i know techniques apply add dimensions of array , save new vet lower in following example:

a -> [1,2], [3,4], [5,6]

b -> [3], [7], [11]

if want write own cuda kernel, take @ vector add sample. instead of passing 2 input vectors kernel, pass a , provide loop sum on "rows" of a:

__global__ void mykernel(int *a, int *b, int rows, int cols){   int idx=threadidx.x+blockdim.x*blockidx.x;   if (idx < rows) {     int sum = 0;     (int i=0; i< cols; i++)       sum += a[(idx*cols)+i];     b[idx] = sum;     } }

this won't terribly efficient, can improve efficiency if can store a array in column major order:

a -> [1,3,5], [2,4,6]

then modification above kernel becomes pretty efficient:

__global__ void mykernel(int *a, int *b, int rows, int cols){   int idx=threadidx.x+blockdim.x*blockidx.x;   if (idx < rows) {     int sum = 0;     (int i=0; i< cols; i++)       sum += a[(i*cols)+idx];     b[idx] = sum;     } }

if you're looking efficiency can't reorganize data, segmented parallel reduction fastest. can try creating based on cuda sample codes suggest using thrust, reduce_by_key

you leave a array , use "values":

a -> [1,2], [3,4], [5,6]

and create "key" array corresponds rows of array:

k -> [0,0], [1,1], [2,2]

Search This Blog

Share

How sum grouped in Cuda -

Comments

Post a Comment

Popular posts from this blog

Line ending issue with Mercurial or Visual Studio -

tags - Jquery Mixitup plugin help prevent handlers being destroyed -

python - Received unregistered task using Celery with Django -