Lecture 28 Optimizing Reduction Kernels

Lecture 28 : Optimizing Reduction Kernels

Reduction Kernel

Lecture 28 optimizing reduction kernels

Download 1M+ code from https://codegive.com/9f5368f okay, let's dive into

Lecture 29 : Optimizing Reduction Kernels (Contd.)

Reduction Kernel

Lecture 30 : Optimizing Reduction Kernels (Contd.)

Complete unrolling, Multiple

Lecture 28: Liger Kernel - Efficient Triton Kernels for LLM Training

Byron Hsu presents LinkedIn's open-source collection of Triton

Lecture 33 : Optimizing Reduction Kernels (Contd.)

Sorting bitinic sequence, All Prefix Sum , Inclusive and exclusive scan.

Lecture 31 : Optimizing Reduction Kernels (Contd.)

Sorting, Sorting Networks, Bitonic Sort Serial Implementation, Recursion.

Optimized Reduction Kernel Explained | CUDA Warp and Block Reduction

In this video, we explore the

Lecture 34 : Optimizing Reduction Kernels (Contd.)

Steel inclusive scan, Prefix Sum Implementation, Blelloch Scan Algorithm and Implementation.

Lecture 32 : Optimizing Reduction Kernels (Contd.)

Comparator, Sorting subproblem, Bitonic Sort Parallel Implementation.

Optimizing Parallel Reduction in CUDA

https://developer.download.nvidia.com/assets/cuda/files/

Mod-01 Lec-28 Optimization

Foundations of

Lecture 16: Warp Scheduling and Divergence

Mapping thread blocks to GPU hardware, SMs SPs, Batches, Scheduling.

Lecture 23: Memory Access Coalescing (Contd.)

Transpose Operation: Naive Row and Naive Col Implementations.

Lecture 24: Memory Access Coalescing (Contd.)

Profiling Analysis using NVPROF, load transactions, store transactions.

Lecture 17: Warp Scheduling and Divergence (Contd.)

Warp Scheduling, SIMD, Lanes.