Skip to main content
New

An Easy Introduction to CUDA

Learn how to write GPU programs in CUDA from scratch, starting with why GPUs exist and ending with a kernel running at over 80% of peak memory bandwidth on real hardware.

Updated May 26, 2026

About this course

Most people who hear 'GPU programming' assume it's for machine learning researchers or graphics engineers. It's not. Any time you're running the same operation on a large array of data, a GPU can do that work orders of magnitude faster than a CPU. The gap isn't marginal. In this course, you go from a single-threaded kernel that takes 75 milliseconds to a prefetched multi-block kernel that finishes in under 50 microseconds on an NVIDIA T4. Same math, same hardware, completely different result. The course builds in a straight line. First you learn why CPUs and GPUs are built differently and what that means for the kinds of problems each one handles well. Then you learn how CUDA threads and blocks work, how to give every thread its own slice of the array, and why block sizes should be multiples of 32. By the end of the second unit you'll have a working parallel kernel. Then the third unit shows you the part most tutorials skip: why that parallel kernel is still slow, how Unified Memory page faults are secretly copying data behind your back, and how one function call fixes it. You don't need a background in computer architecture or prior GPU experience. You need to be comfortable writing C-style code and curious about why programs run at the speed they do. Every concept here is tied to a concrete number you can measure yourself, so you're not taking anyone's word for what's fast.

Details

Last updated May 26, 2026
3 Units, 6 lessons
4 Projects
3 Assessments

Skills you'll gain with this course

CPU vs. GPU Reasoning

Identify whether a given problem is a good fit for GPU acceleration based on how the work is structured.

CUDA Kernel Writing

Write and launch parallel CUDA kernels using the thread and block model, including the grid-stride loop pattern.

Memory Profiling

Read an nsys profiler output, spot Unified Memory page fault traffic, and diagnose when a kernel is memory-bound.

Prefetch Optimization

Use cudaMemPrefetchAsync to eliminate on-demand page migrations and get kernel runtimes close to peak hardware bandwidth.

Syllabus

3 Units • 6 Lessons • 4 Projects • 3 Assessments

Ways To Learn Included

Every lesson enables you to learn in a variety of ways.

3 min read
587 words

These gases, such as carbon dioxide and methane, play a crucial role in regulating Earth's temperature. But what exactly are they, and how do they work? Let's find out.

Read
Carbon Dioxide
Flashcards
Quiz
What is the primary greenhouse gas responsible for trapping heat?
Carbon Dioxide
Locked In
Great job! That's the correct answer.
Quiz
The earth's atmosphere is composed
Lecture
Listen: Greenhouse gases explained
Podcast
Chat
0:05
Jam
Arcade
Video
Comic

FAQ

Course thumbnail