GPU Programming Guide

A comprehensive introduction to GPU programming, covering CUDA, OpenCL, and modern GPU computing concepts.

Home / notes / GPU Programming Guide

Introduction

GPU (Graphics Processing Unit) programming has evolved far beyond graphics rendering to become a cornerstone of modern high-performance computing. GPUs excel at parallel processing tasks, making them ideal for machine learning, scientific computing, data analysis, and more.

Why GPU Programming?

Massive Parallelism: Thousands of cores working simultaneously
High Memory Bandwidth: Optimized for data-intensive operations
Cost-Effective: Better performance per dollar than traditional CPUs for parallel workloads
Wide Adoption: Essential for AI/ML, scientific computing, and financial modeling

Key Concepts

Parallel Processing Model

GPUs use a Single Instruction, Multiple Data (SIMD) architecture where many cores execute the same instruction on different data simultaneously.

Memory Hierarchy

Global Memory: Large, high-latency memory accessible by all threads
Shared Memory: Fast, on-chip memory shared within thread blocks
Local Memory: Per-thread private memory
Constant Memory: Read-only memory cached for fast access
Texture Memory: Optimized for spatial locality

Thread Organization

Grid: Collection of thread blocks
Block: Group of threads that can cooperate and share memory
Thread: Individual execution unit
Warp: Group of 32 threads (CUDA) that execute in lockstep

Programming Models

CUDA (Compute Unified Device Architecture)

NVIDIA’s proprietary parallel computing platform and programming model.

Key Features:

C/C++ extensions for GPU programming
Extensive tooling and documentation
Strong ecosystem and community support
Optimized for NVIDIA GPUs

Basic CUDA Program Structure:

__global__ void kernel_function(int *data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        data[idx] = data[idx] * 2;  // Example operation
    }
}

int main() {
    // Host memory allocation
    int *h_data, *d_data;
    // Device memory allocation
    cudaMalloc(&d_data, size);
    // Copy data to device
    cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice);
    // Launch kernel
    kernel_function<<<blocks, threads>>>(d_data, n);
    // Copy results back
    cudaMemcpy(h_data, d_data, size, cudaMemcpyDeviceToHost);
    return 0;
}

OpenCL (Open Computing Language)

Cross-platform, open standard for parallel programming.

Key Features:

Vendor-agnostic (works on NVIDIA, AMD, Intel GPUs)
C-based programming language
Runtime compilation
More complex setup but greater flexibility

Modern Alternatives

HIP (Heterogeneous-Computing Interface for Portability)

AMD’s CUDA portability layer
Allows CUDA code to run on AMD GPUs
Useful for cross-platform development

SYCL (SYCL)

C++ abstraction layer for heterogeneous computing
Single-source programming model
Part of the C++ standard

WebGPU

Web standard for GPU computing
Enables GPU programming in web browsers
JavaScript/TypeScript API

Getting Started

Prerequisites

Basic C/C++ programming knowledge
Understanding of parallel programming concepts
Familiarity with computer architecture

Development Environment Setup

CUDA Development:

Install NVIDIA drivers
Install CUDA Toolkit
Set up IDE (Visual Studio, Eclipse, or command line)
Verify installation with nvcc --version

OpenCL Development:

Install vendor-specific SDK (NVIDIA, AMD, Intel)
Install OpenCL headers and libraries
Set up development environment

First Steps

Start with Simple Examples
- Vector addition
- Matrix multiplication
- Reduction operations
Learn Memory Management
- Host vs. device memory
- Memory allocation and transfer
- Memory coalescing
Understand Thread Organization
- Grid and block dimensions
- Thread indexing
- Synchronization
Profile and Optimize
- Use profiling tools (nvprof, Nsight)
- Identify bottlenecks
- Optimize memory access patterns

Resources

Official Documentation

Books

“CUDA by Example” by Jason Sanders and Edward Kandrot
“Professional CUDA C Programming” by John Cheng, Max Grossman, Ty McKercher
“OpenCL Programming Guide” by Aaftab Munshi, Benedict Gaster, Timothy G. Mattson
“Programming Massively Parallel Processors” by David B. Kirk and Wen-mei W. Hwu

Online Courses

Tutorials and Examples

Tools and Libraries

Profiling: NVIDIA Visual Profiler, AMD CodeXL
Debugging: NVIDIA Nsight, AMD Radeon GPU Profiler
Libraries: cuBLAS, cuDNN, OpenCL BLAS
Frameworks: TensorFlow, PyTorch (GPU support)

Communities and Forums

Best Practices

Memory Management
- Minimize host-device transfers
- Use pinned memory for frequent transfers
- Align memory accesses
Thread Organization
- Choose appropriate block sizes (typically 256-1024 threads)
- Ensure sufficient occupancy
- Avoid thread divergence
Optimization
- Profile before optimizing
- Focus on memory bandwidth first
- Use shared memory effectively
- Minimize register usage
Debugging
- Use proper error checking
- Validate results on CPU first
- Use debugging tools and assertions

Common Pitfalls

Memory Leaks: Always free device memory
Synchronization: Understand when synchronization is needed
Thread Divergence: Avoid conditional branches that cause threads to diverge
Memory Coalescing: Ensure memory accesses are coalesced for optimal performance

Performance Considerations

Memory Bandwidth: Often the limiting factor
Occupancy: Balance between register usage and thread count
Instruction Throughput: Choose appropriate instruction mix
Memory Latency: Use memory hierarchy effectively