Naive addition of two vectors

This implements a function that adds two arrays just to show how to import a CUDA module in Cython.


One should run:

nvcc --use-local-env --cl-version 2010 -lib -o cuda_blockadd.lib
nvcc --use-local-env --cl-version 2010 -lib -o cuda_threadadd.lib
nvcc --use-local-env --cl-version 2010 -lib -o cuda_btadd.lib
nvcc --use-local-env --cl-version 2010 -lib -o cuda_longadd.lib

And then:

python build_ext -i -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.0\include" -lcudart -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.0\lib\x64" --force clean


Some parameters may change according to your system!


The compiling steps are included in compile.bat so one can just run it.


Thre testing routine will sum two random 1-D arrays with n=15000000 terms and compare the results with numpy. Only the parallel='block-thread' should give the correct result:


Which will produce something like:

Sum using parallel block is correct: False
Sum using parallel thread is correct: False
Sum using parallel block-thread is correct: True
Sum using parallel long-array is correct: True


Timing the performance of this naive add function in IPython:

timeit np.add(a, b)
10 loops, best of 3: 26.3 ms per loop

timeit my_cuda.add(a, b, parallel='block-thread')
10 loops, best of 3: 108 ms per loop

timeit my_cuda.add(a, b, parallel='long-array')
10 loops, best of 3: 108 ms per loop

which is already pretty good...


The implementation using parallel='block' is limited to arrays up to n=65535 which is the maximum number of blocks, and for parallel='thread' to arrays up to n=512 which is the maximum number of threads per execution block.

The implementation using parallel='block-thread' is limited to n=65535*256=16776960 in the way it is implemented and it could be expanded, but the solution in parallel='long-array' is easier to manage and gives the same performance, being unlimited to the size of the array and limited only by the amount of RAM memory in the GPU.

Possible Improvements

Do one cache-friendly access in the 'long-array' implementation since the pace of blockDim.x * gridDim.x seems to require a large stride that will probably result in many cache losses (not sure though...).

Table Of Contents

Previous topic

Running CUDA in Python through Cython

Next topic


This Page