CUDA Programming Model II
Previously, we discuss the basic programming model of CUDA: memory and thread. In this note, we will dive into the hardware implementation. By doing this, we can better understand the philosophy of CUDA programming, so as to accelerate the computation. The GPU architecture is built around a scalable array of multi-threaded Streaming Multiprocessors (SMs). There are usually numerous SMs in every GPU. And every SM can host hundreds of threads. When a CUDA program on the host GPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available and distributed to SMs with available execution capacity. ...