Here I keep track of some questions I really wanted to answer during my studies on CUDA C.

  1. If CUDA C manages the distribution to the threads and blocks, what are the implications of using different block and thread sizes?:

  2. If the shared memory latency is \approx 100 \times lower than uncached global memory latency, how to make the access to the array more cache friendly?:


Previous topic

Naive addition of two vectors

Next topic


This Page