This implements a function that adds two arrays just to show how to import a CUDA module in Cython.
One should run:
nvcc --use-local-env --cl-version 2010 -lib -o cuda_blockadd.lib cuda_blockadd.cu
nvcc --use-local-env --cl-version 2010 -lib -o cuda_threadadd.lib cuda_threadadd.cu
nvcc --use-local-env --cl-version 2010 -lib -o cuda_btadd.lib cuda_btadd.cu
nvcc --use-local-env --cl-version 2010 -lib -o cuda_longadd.lib cuda_longadd.cu
And then:
python setup.py build_ext -i -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.0\include" -lcudart -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.0\lib\x64" --force clean
Note
Some parameters may change according to your system!
Note
The compiling steps are included in compile.bat so one can just run it.
Thre testing routine will sum two random 1-D arrays with n=15000000 terms and compare the results with numpy. Only the parallel='block-thread' should give the correct result:
python test.py
Which will produce something like:
Sum using parallel block is correct: False
Sum using parallel thread is correct: False
Sum using parallel block-thread is correct: True
Sum using parallel long-array is correct: True
Timing the performance of this naive add function in IPython:
timeit np.add(a, b)
10 loops, best of 3: 26.3 ms per loop
timeit my_cuda.add(a, b, parallel='block-thread')
10 loops, best of 3: 108 ms per loop
timeit my_cuda.add(a, b, parallel='long-array')
10 loops, best of 3: 108 ms per loop
which is already pretty good...
The implementation using parallel='block' is limited to arrays up to n=65535 which is the maximum number of blocks, and for parallel='thread' to arrays up to n=512 which is the maximum number of threads per execution block.
The implementation using parallel='block-thread' is limited to n=65535*256=16776960 in the way it is implemented and it could be expanded, but the solution in parallel='long-array' is easier to manage and gives the same performance, being unlimited to the size of the array and limited only by the amount of RAM memory in the GPU.
Do one cache-friendly access in the 'long-array' implementation since the pace of blockDim.x * gridDim.x seems to require a large stride that will probably result in many cache losses (not sure though...).