Speed up your AI code
Christian Hudon Christian Hudon
November 17 7 min

Speed up your AI code

On November 26, 2020 I gave a presentation on how to speed up AI code where I talked about how to profile and accelerate a PyTorch model, using Element AI N-Beats as an example. The presentation is perfect for both AI developers who productize models and practitioners who create models and want their training to go faster. It is also meant as a window into how you could dive in head first - with me! - and work on AI models at Element AI. Interested in joining our awesome dev team at Element AI? See our job openings. Below, you will find the presentation, the slides and a checklist to help you speed up your AI code.

The Presentation

The Slides

Download the slides.

The Checklist

Here is a list of the actions to take before and during profiling, as discussed in the talk.


  • Use a sampling profiler or two! (NVIDIA Nsight Systems for the GPU and system overview, PyInstrument / Py-Spy / Scalene for the Python code.)
  • Simplify the problem! Disable hyperparameter search and other parts that you know scale without problems.
  • To be able to iterate quickly, set up a training configuration with one epoch & a subset of your training data. Aim for a model that completes training in 5 minutes. (That’s also the maximum run time supported by NVIDIA’s Nsight Systems profiler.)
  • Make sure you’re always getting the same generation of GPU & CPU — especially if you submit your training jobs to a pool of machines — so the timing numbers are comparable from one run to the other.
  • Have a look at the test error. Not to assess the quality of your model — it won’t be good on only one epoch! But to help detect if the changes you make to speed up your code affect the results — they shouldn’t.
  • Add NVTX annotations for the important parts of your training pipeline, so you can see (in Nsight Systems) where your training time is being spent.
  • Create at least in your mind a picture of the whole system, with a rough idea of the speed of different parts. Don’t skip this part! We’ll use this as soon as we get the first profile.
  • Otherwise, focus your effort where it will have an impact. What if (after infinite effort?) the part of the system you want to speed up ended up running in zero time? How would it impact the total runtime of your program?

Profiling and Improving your Code

  • Look at your first code execution profile with the picture of the whole system you built earlier! Is — for our case — loading the dataset from storage a bottleneck? Is the creation of each mini-batch a bottleneck? There’s no point optimizing the GPU part of the training code if the CPU part or dataset loading one are the bottleneck! The most important point of this section. Keep the GPU busy! Nsight Systems makes it very easy to see those bottlenecks.
  • If the bottleneck is Python code, it will show up in the Python profiler run.
    • If the code is called from a PyTorch DataLoader and you have CPUs to spare, can you simply set DataLoader’s num_workers parameters high enough so that it can keep up with the GPU? Easiest solution if you can.
    • Can you rewrite that code in a vectorized way?
    • Maybe it’s only useful for special cases? Bypass it for the common case then.
  • If the bottleneck is data transfers, it will show up in the GPU memory timelines of Nsight Systems.
    • Can you transfer your data in fewer bigger chunks? That can be many orders of magnitude more efficient.
    • Can you overlap data transfer with processing, using pinned memory on the CPU / DataLoader side, and non-blocking GPU memory transfers?
  • If the GPU still isn’t busy enough, it will show up in the GPU timelines of Nsight Systems too.
    • Can you increase the size of your mini-batches without negatively impacting your test error? This needs to be confirmed with full experiment runs, but if you give your GPU more data to crunch on at the same time, it will have an easier time keeping busy.
    • Can you use CUDA Streams to have multiple independent computations running at the same time on the GPU? (NVIDIA presentation slides on CUDA Streams.)
  • If you’re still not satisfied...
    • Convert your PyTorch model to PyTorch-Lightning, then change one line of code to get: multi-GPU & distributed training, and float16 training too!
    • Get Python out of the loop.
      • Try out TorchScript (especially for inference).
      • Try Numba to compile your hard-to-vectorize Python code (with the @numba.jit and @numba.cuda.jit decorators).
    • Do more of the preprocessing on the GPU. Consider:
      • Rapids.AI: subset of Pandas, Scikit-Learn and multiple other scientific libraries, that run on CUDA.
      • DALI: NVIDIA’s data loading library
      • CUVI: the CUDA Vision and Imaging library
      • CuPy, if you have Numpy code (mostly a drop-in replacement)
      • … and many more. Browse through the CUDA-X list. And many more good, opensource libraries; the Internet is (almost) your oyster!

The Takeaways

  • A sampling profiler is a great tool and a force multiplier for this kind of work. Take the time to learn how to use them. It can change finding bottlenecks in your code from “Answering each question is an expedition” to “Asked and answered 10 profiling questions before lunch!”
  • Always start your work with a model of your system! Then use it to ask questions when you look at the output of profilers. This is an essential step, but one that is often not made explicit. The highest payoff will be when going from viewing your system as a vague blob of “code to make faster” to 3-5 top-level boxes of the main pieces and their relative speeds.
  • First, eliminate bottlenecks in your training pipeline. Only then should you focus on making the fastest part (the GPU for the “model training” case) as busy as possible.
  • It’s worth checking your code with a profiler even for research code. Basically, do it as soon as you’re starting to run many experiments. How much more productive would you be if your model trained 50% faster? What if it trained twice as fast? Ten times as fast? Maybe you have a simple bottleneck that you can fix quickly, that will give you this kind of a speedup. But you’ll never know until you check. The potential payoff is worth doing a quick check earlier rather than later.
  • Resist the temptation to guess at performance problems. Treat making your code faster as a scientific experiment. Confirm your hypothesis with the right tool (here, a sampling profiler) before you start to try to optimize code. Otherwise, you can invest time yet have minimal or no impact.
  • Rely on the work of others who know much more than you as much as possible. For example, vectoring your code is not just because “Python is slow”. It’s also reusing the work of world-class experts in numerical computing, that took into account all the details of how all the different versions of a CPU or GPU work, in order to get the most speed out of, say. a matrix multiplication operation. Even if you’re writing C or C++, you should aim to use libraries with that level of effort invested in them as much as possible when it makes sense.