Naive addition of two vectors ============================= This implements a function that adds two arrays just to show how to import a CUDA module in Cython. Installing ---------- One should run:: nvcc --use-local-env --cl-version 2010 -lib -o cuda_blockadd.lib cuda_blockadd.cu nvcc --use-local-env --cl-version 2010 -lib -o cuda_threadadd.lib cuda_threadadd.cu nvcc --use-local-env --cl-version 2010 -lib -o cuda_btadd.lib cuda_btadd.cu nvcc --use-local-env --cl-version 2010 -lib -o cuda_longadd.lib cuda_longadd.cu And then:: python setup.py build_ext -i -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.0\include" -lcudart -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.0\lib\x64" --force clean .. note:: Some parameters may change according to your system! .. note:: The compiling steps are included in ``compile.bat`` so one can just run it. Testing ------- Thre testing routine will sum two random 1-D arrays with ``n=15000000`` terms and compare the results with ``numpy``. Only the ``parallel='block-thread'`` should give the correct result:: python test.py Which will produce something like:: Sum using parallel block is correct: False Sum using parallel thread is correct: False Sum using parallel block-thread is correct: True Sum using parallel long-array is correct: True Timing ------ Timing the performance of this naive add function in IPython:: timeit np.add(a, b) 10 loops, best of 3: 26.3 ms per loop timeit my_cuda.add(a, b, parallel='block-thread') 10 loops, best of 3: 108 ms per loop timeit my_cuda.add(a, b, parallel='long-array') 10 loops, best of 3: 108 ms per loop which is already pretty good... Comments -------- The implementation using ``parallel='block'`` is limited to arrays up to ``n=65535`` which is the maximum number of blocks, and for ``parallel='thread'`` to arrays up to ``n=512`` which is the maximum number of threads per execution block. The implementation using ``parallel='block-thread'`` is limited to ``n=65535*256=16776960`` in the way it is implemented and it could be expanded, but the solution in ``parallel='long-array'`` is easier to manage and gives the same performance, being unlimited to the size of the array and limited only by the amount of RAM memory in the GPU. Possible Improvements --------------------- Do one cache-friendly access in the ``'long-array'`` implementation since the pace of ``blockDim.x * gridDim.x`` seems to require a large stride that will probably result in many cache losses (not sure though...).