The fastest path to building state-of-the-art AI
Element AI Element AI
April 15 9 min

The fastest path to building state-of-the-art AI

By Boris Oreshkin and Dmitri Carpov

Boris Oreshkin and Dmitri Carpov.
Dmitri Carpov (left) and Boris Oreshkin

We invented a groundbreaking new time-series forecasting model. Here’s how we harnessed the enormous computing power we needed to make it happen.

A little more than a year ago, a small group of us at Element AI, together with our collaborator Yoshua Bengio of the Quebec AI Institute (Mila), became absorbed by a challenging scientific question that took us months to untangle. Was it possible, we wanted to know, to build a time-series forecasting model with a purely deep neural architecture that could beat the accuracy of more established traditional approaches? Many had tried, but with little success: pure machine learning models were producing consistently underwhelming results.

We tried a different approach—and ultimately, to our great excitement, it worked. For the first time, a purely deep-learning model, which we called N-Beats, achieved state-of-the-art performance in univariate time-series forecasting. Today, we continue to explore the power of our model, which we believe may extend to areas outside of time-series forecasting.

N-Beats has a strong theoretical foundation, but we needed more than a whiteboard and some dry erase markers to build it. We needed to run thousands of experiments in parallel—and that required massive computing power. We benefited from access to an internal set of tools built to help our teams streamline the development of production-grade AI models and applications. In particular, Element AI Orkestrator, our GPU scheduling tool, enabled us to share parallel computing resources with colleagues from across the company and allowed us to run thousands of hyperparameter search and architecture optimisation jobs in parallel. By using these tools to power our experiments, we were able to ensure that multiple experiments could run at once and be compared, take advantage of moments when the cluster was underused and provide our team with the crucial GPU resources needed for N-Beats to emerge—all without monopolizing the cluster or preventing our colleagues from using the same resources.

Why a pure deep-learning approach to time-series forecasting?

Time-series forecasting analyzes past data to predict future values. It is central to just about every industry, and for good reason: sometimes even a tiny improvement in accuracy can translate into millions of dollars of operational savings.

Time-series datasets consist of a sequence of ordered data points, where the order is defined by the time when the data point was registered or observed. Models train on these past data points and then use that data to make their best prediction of what the values will be in the future. Imagine, for example, a dataset that contained the nationwide sales of VR headsets on a weekly basis over the past year. A time-series forecasting model might train on that dataset, taking into account both seasonal considerations, like holiday shopping, and the overall upward or downward trend. Using this past data, the model could then predict future sales of VR headsets at weekly intervals, enabling retailers to plan for and better meet demand.

Up until the development of N-Beats, there was quite a bit of skepticism that a model with pure deep-learning architectures could compete with statistical approaches. So-called “hybrid” approaches that combine classical statistical methods with machine learning, however, were showing great potential. As some of the leading voices in the field put it in a 2018 paper, “hybrid approaches and combinations of method[s] are the way forward for improving the forecasting accuracy and making forecasting more valuable.”

If “hybrid” models demonstrated outstanding results by beating the best statistical models by a substantial margin, we reasoned, there might be an even more significant opportunity to achieve state-of-the-art performance with a purely deep-learning model.

The logical place to start was the M4 dataset. Back when we carried out this research, this was the dataset from the then most recent M-Competition, an international forecasting contest that dates back to 1982. The M4 Competition took place in 2018 and was won by a hybrid model developed by S. Smyl from Uber. Since the dataset from this competition is publicly available, we were able to use it to test our new deep-learning models.

We designed a rigorous experimental set-up, which would require a series of large-scale experiments on our GPU cluster. Here at Element AI, we have extensive computing resources, but these large-scale experiments would nonetheless seriously challenge our computer infrastructure.

The view from the pit

The orchestration of computing resources is essential for any problem that requires parallel calculations. In the case of N-Beats, we weren’t developing just a single model -- in the M4 experiment, for example, N-Beats was a collection of 180 different models.

Building these models required running thousands of experiment trials in parallel, a process that is very GPU hungry and can take huge amounts of time. Leveraging the power of our AI tools, we were able to test different numbers of hyperparameters, like the number of blocks, block sizes and window sizes while also reducing the time it took experiments to complete by orders of magnitude. For each parameter configuration, we needed to train 180 models to test their performance on the validation set, and the hyperparameter search required access to a vast number of our NVIDIA GPUs, all at the same time. In fact, it relied on nearly half of the Element AI cluster, which was under substantial demand from the hundreds of other AI training jobs that other researchers were also launching for their respective projects.

Monthly breakdown of cluster use for the development of N-Beats.

Today at Element AI, all AI practitioners run jobs on our GPU cluster using Element AI Orkestrator. It’s a tool that actively manages the allocation of GPU resources, and we developed it in-house almost three years ago, specifically to ensure that our teams can get access to the computing power they need to innovate and build transformative new AI products for industry.

It wasn’t always that way. Previously, Element AI practitioners used only about a quarter of their GPU resources at any given moment. Practitioners were still running big jobs, just a lot less efficiently, so R&D took more time. But today, GPU usage at Element AI averages at about 90%, and our overall productivity has skyrocketed, with jobs running in parallel at all hours of the day and in highly efficient configurations. This allowed us to launch expansive experiments and get results within just a single day.

Our team found that we incurred our most significant computational expense when we were working to simplify our model. Here’s a technical summary of our approach: We started with a vanilla seq2seq modelling approach involving a stack of encoder/decoder blocks built from CNN/LSTM primitives. This solution turned out to be computationally hungry, taking 24 hours to train one model on the M4 dataset, and was not even close to the state-of-the-art. LSTMs have problems extrapolating trends and handling non-stationarity in time series (i.e., time series in which statistical properties, such as the mean, change with time). For this reason, heavy pre-processing is typically used in forecasting applications involving LSTMs. This is where the original idea of inserting a learnable trend removal pre-processing block emerged. The trend removal block, based on learnable polynomial approximation, subtracted its trend estimate from the CNN/LSTM stack input and provided a partial trend forecast to the output.

This new approach worked much better. So we started stacking learnable trend removal blocks and added learnable seasonality pre-processing blocks. It soon became apparent that the CNN/LSTM stack was not doing anything but consuming our compute power during training. So we removed it and reduced the train time by a factor of 10.

Finally, we asked ourselves if trend and seasonality modelling could be learned by a generic, fully connected network in a completely unconstrained way. We determined that a simple, fully connected architecture with no domain knowledge and data pre-processing performs just as well as the one based on trend and seasonality modelling ideas. This was when N-Beats was officially born.

Delighting our internal users first

Element AI is a large collaborative environment made up of more than 100 PhDs and AI practitioners. We work in small teams and on independent projects, all of which rely on ready access to our cluster. Without a GPU scheduling tool, right of usage was often a point of contention among researchers: who was allowed priority access to the cluster and why? With Orkestrator, GPU time was allocated on a fairer basis, allowing teams to avoid the politics of persuading gatekeepers that their project was the most important, the most urgent, the most exciting and therefore the one to be prioritized over all others.

It also gives AI practitioners and IT administrators a bird’s eye view of the jobs running on the cluster, so monitoring on a large-scale experiment becomes a lot easier than it would be if users had to manually try to figure out if their jobs are running correctly. This turns into a huge time-saver. In our case, without a GPU scheduling tool, we would have spent most of our time monitoring and orchestrating the jobs we were running. Instead, most of the process was entirely automated, freeing up our time to focus on scientific questions.

In the end, the results of thousands and thousands of experiments established N-Beats as the state-of-the-art in univariate time-series forecasting tasks. It wasn’t just better; it was also conceptually simpler and faster to train than other leading approaches, taking only about an hour to train on 100,000 time series on our GPU cluster.

But it doesn’t end there. Our recent paper shows that N-Beats can be trained on one dataset, then applied to a different one, even one that does not contain time series from the training set, and still outperform most popular statistical methods. For example, N-Beats trained on the M4 dataset can be applied to time series from different domains, such as tourism statistics or highway lane occupancy.

N-Beats, as it turns out, is an implementation of a meta-learning algorithm; in other words, it is an algorithm that is able to learn how to learn, and can therefore adapt itself and make good predictions even on datasets with few times series that it has never observed before. This constitutes an important contribution to machine learning algorithms. And it’s something we could not have achieved without enormous, highly optimized computing resources and the power to test our hypotheses at scale.

Our AI tools are what made these advancements possible. They reinforce best practices and support collaboration across groups. They abstract some of the engineering tasks related to running jobs on GPUs. And they help run more experiment trials in less time, accelerating the training and testing of models and enabling the kind of quick iteration and innovation at scale that we were able to achieve in the development of N-Beats.