Naive addition of two vectors

This implements a function that adds two arrays just to show how to import a CUDA module in Cython.

Installing

One should run:

nvcc --use-local-env --cl-version 2010 -lib -o cuda_blockadd.lib cuda_blockadd.cu
nvcc --use-local-env --cl-version 2010 -lib -o cuda_threadadd.lib cuda_threadadd.cu
nvcc --use-local-env --cl-version 2010 -lib -o cuda_btadd.lib cuda_btadd.cu
nvcc --use-local-env --cl-version 2010 -lib -o cuda_longadd.lib cuda_longadd.cu

And then:

python setup.py build_ext -i -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.0\include" -lcudart -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.0\lib\x64" --force clean

Note

Some parameters may change according to your system!

Note

The compiling steps are included in compile.bat so one can just run it.

Testing

Thre testing routine will sum two random 1-D arrays with n=15000000 terms and compare the results with numpy. Only the parallel='block-thread' should give the correct result:

python test.py

Which will produce something like:

Sum using parallel block is correct: False
Sum using parallel thread is correct: False
Sum using parallel block-thread is correct: True
Sum using parallel long-array is correct: True

Timing

Timing the performance of this naive add function in IPython:

timeit np.add(a, b)
10 loops, best of 3: 26.3 ms per loop

timeit my_cuda.add(a, b, parallel='block-thread')
10 loops, best of 3: 108 ms per loop

timeit my_cuda.add(a, b, parallel='long-array')
10 loops, best of 3: 108 ms per loop

which is already pretty good...

Comments

The implementation using parallel='block' is limited to arrays up to n=65535 which is the maximum number of blocks, and for parallel='thread' to arrays up to n=512 which is the maximum number of threads per execution block.

The implementation using parallel='block-thread' is limited to n=65535*256=16776960 in the way it is implemented and it could be expanded, but the solution in parallel='long-array' is easier to manage and gives the same performance, being unlimited to the size of the array and limited only by the amount of RAM memory in the GPU.

Possible Improvements

Do one cache-friendly access in the 'long-array' implementation since the pace of blockDim.x * gridDim.x seems to require a large stride that will probably result in many cache losses (not sure though...).

Table Of Contents

Previous topic

Running CUDA in Python through Cython

Next topic

Questions

This Page