GPU Programming Guide

A comprehensive introduction to GPU programming, covering CUDA, OpenCL, and modern GPU computing concepts.

Introduction

GPU (Graphics Processing Unit) programming has evolved far beyond graphics rendering to become a cornerstone of modern high-performance computing. GPUs excel at parallel processing tasks, making them ideal for machine learning, scientific computing, data analysis, and more.

Why GPU Programming?

  • Massive Parallelism: Thousands of cores working simultaneously
  • High Memory Bandwidth: Optimized for data-intensive operations
  • Cost-Effective: Better performance per dollar than traditional CPUs for parallel workloads
  • Wide Adoption: Essential for AI/ML, scientific computing, and financial modeling

Key Concepts

Parallel Processing Model

GPUs use a Single Instruction, Multiple Data (SIMD) architecture where many cores execute the same instruction on different data simultaneously.

Memory Hierarchy

  • Global Memory: Large, high-latency memory accessible by all threads
  • Shared Memory: Fast, on-chip memory shared within thread blocks
  • Local Memory: Per-thread private memory
  • Constant Memory: Read-only memory cached for fast access
  • Texture Memory: Optimized for spatial locality

Thread Organization

  • Grid: Collection of thread blocks
  • Block: Group of threads that can cooperate and share memory
  • Thread: Individual execution unit
  • Warp: Group of 32 threads (CUDA) that execute in lockstep

Programming Models

CUDA (Compute Unified Device Architecture)

NVIDIA’s proprietary parallel computing platform and programming model.

Key Features:

  • C/C++ extensions for GPU programming
  • Extensive tooling and documentation
  • Strong ecosystem and community support
  • Optimized for NVIDIA GPUs

Basic CUDA Program Structure:

__global__ void kernel_function(int *data, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        data[idx] = data[idx] * 2;  // Example operation
    }
}

int main() {
    // Host memory allocation
    int *h_data, *d_data;
    // Device memory allocation
    cudaMalloc(&d_data, size);
    // Copy data to device
    cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice);
    // Launch kernel
    kernel_function<<<blocks, threads>>>(d_data, n);
    // Copy results back
    cudaMemcpy(h_data, d_data, size, cudaMemcpyDeviceToHost);
    return 0;
}

OpenCL (Open Computing Language)

Cross-platform, open standard for parallel programming.

Key Features:

  • Vendor-agnostic (works on NVIDIA, AMD, Intel GPUs)
  • C-based programming language
  • Runtime compilation
  • More complex setup but greater flexibility

Modern Alternatives

HIP (Heterogeneous-Computing Interface for Portability)

  • AMD’s CUDA portability layer
  • Allows CUDA code to run on AMD GPUs
  • Useful for cross-platform development

SYCL (SYCL)

  • C++ abstraction layer for heterogeneous computing
  • Single-source programming model
  • Part of the C++ standard

WebGPU

  • Web standard for GPU computing
  • Enables GPU programming in web browsers
  • JavaScript/TypeScript API

Getting Started

Prerequisites

  • Basic C/C++ programming knowledge
  • Understanding of parallel programming concepts
  • Familiarity with computer architecture

Development Environment Setup

CUDA Development:

  1. Install NVIDIA drivers
  2. Install CUDA Toolkit
  3. Set up IDE (Visual Studio, Eclipse, or command line)
  4. Verify installation with nvcc --version

OpenCL Development:

  1. Install vendor-specific SDK (NVIDIA, AMD, Intel)
  2. Install OpenCL headers and libraries
  3. Set up development environment

First Steps

  1. Start with Simple Examples
    • Vector addition
    • Matrix multiplication
    • Reduction operations
  2. Learn Memory Management
    • Host vs. device memory
    • Memory allocation and transfer
    • Memory coalescing
  3. Understand Thread Organization
    • Grid and block dimensions
    • Thread indexing
    • Synchronization
  4. Profile and Optimize
    • Use profiling tools (nvprof, Nsight)
    • Identify bottlenecks
    • Optimize memory access patterns

Resources

Official Documentation

Books

  • “CUDA by Example” by Jason Sanders and Edward Kandrot
  • “Professional CUDA C Programming” by John Cheng, Max Grossman, Ty McKercher
  • “OpenCL Programming Guide” by Aaftab Munshi, Benedict Gaster, Timothy G. Mattson
  • “Programming Massively Parallel Processors” by David B. Kirk and Wen-mei W. Hwu

Online Courses

Tutorials and Examples

Tools and Libraries

  • Profiling: NVIDIA Visual Profiler, AMD CodeXL
  • Debugging: NVIDIA Nsight, AMD Radeon GPU Profiler
  • Libraries: cuBLAS, cuDNN, OpenCL BLAS
  • Frameworks: TensorFlow, PyTorch (GPU support)

Communities and Forums

Best Practices

  1. Memory Management
    • Minimize host-device transfers
    • Use pinned memory for frequent transfers
    • Align memory accesses
  2. Thread Organization
    • Choose appropriate block sizes (typically 256-1024 threads)
    • Ensure sufficient occupancy
    • Avoid thread divergence
  3. Optimization
    • Profile before optimizing
    • Focus on memory bandwidth first
    • Use shared memory effectively
    • Minimize register usage
  4. Debugging
    • Use proper error checking
    • Validate results on CPU first
    • Use debugging tools and assertions

Common Pitfalls

  1. Memory Leaks: Always free device memory
  2. Synchronization: Understand when synchronization is needed
  3. Thread Divergence: Avoid conditional branches that cause threads to diverge
  4. Memory Coalescing: Ensure memory accesses are coalesced for optimal performance

Performance Considerations

  • Memory Bandwidth: Often the limiting factor
  • Occupancy: Balance between register usage and thread count
  • Instruction Throughput: Choose appropriate instruction mix
  • Memory Latency: Use memory hierarchy effectively