• Sumit Sanyal

The commercial reality behind DeepMind’s paper: “Reward is enough”

The paper from DeepMind titled “Reward is Enough” [1] is making waves in the Artificial Intelligence community, simultaneously eliciting high praise and strong skepticism. We at minds.ai have been making “reward maximization” based algorithmic solutions (controllers, designs and schedulers) a reality for our enterprise customers for two years now using our DeepSim [2] platform. We have been successful in creating trained agents which meet or exceed existing state-of-the-art performance benchmarks in diverse applications such as:

Based on the experience we have gathered while doing the hard work of training agents that can function in a complex and noisy real world (as opposed to perfect scenario generation in gaming) we offer the following commentary on Silver and Sutton’s excellent paper.


The (Grand) Premise

The audacity of the paper is reflected in the words left out of the title - “.. for General Intelligence”. Those words get straight to the heart of the controversy and hype surrounding the field of modern AI (whose beginning is marked by the publication of Alex Krizhevsky’s paper [3]). Even so, we like where the authors are going with this.


With it’s startling lack of mathematical equations and the remarkable number of scientific disciplines covered (mathematics, game theory, computer science, biology, evolution, neuroscience, linguistics and human relations), the paper follows the hallowed tradition of other paradigm shifting works such as Schroedinger’s paper on the physics of Life [4] and Hinton’s recent musings on part-whole hierarchies [5].


The lack of equations is beguiling though, since the paper is chock full of references which are extremely technical in nature. In fact, this paper more than serves its purpose as a review of the various significant publications by the pioneers of the field and brings it all together in a surprisingly coherent narrative.


Business KPIs = Simple Reward = RL commercialization

The AlphaGo discussion about singular vs composite reward functions is particularly insightful. As we work with our customers to solve real world problems using Reinforcement Learning, we are continually amazed by the efficacy of the following simple methodology:


  1. Work with the domain expertise that our customers have, to formulate their business level objectives in terms of a simple and high level reward function. We echo the insights provided by the authors:

  2. Express the reward at the highest and most comprehensive level. E.g. maximize energy production (for power plants), reduce energy consumption (for vehicles and drones), increase capacity utilization, maximize passenger comfort etc. Avoid the tendency to “help the algorithm” by decomposing the reward into sub goals. This limits the creativity of the search algorithm by needlessly limiting the search space of solutions to the ones that human brains can intuit. In our opinion such decomposition negates some of the most important benefits of the RL methodology.

  3. Whenever possible, express the reward integrated over time [*] e.g. one reward at the end of the drive, or a plant’s production over days, months or even years. This allows the agent to discover novel time varying strategies to deal with the ever changing nature of most complex systems. This allows the RL agent, for example, to make efficient capex vs opex tradeoffs.

  4. When dealing with multiple interacting systems, try to express the reward in terms of the combined system’s performance. This enables the agent to discover integrated control techniques. Such integration has been a problem with classical control system techniques.

  5. Formulate the optimization problem stated in the reward as an RL problem and do millions of training steps using suitably representative simulations (in other words, let the RL algorithms do their magic using DeepSim).

  6. Deploy in the field and enable ongoing upgrades with off-policy learning.





We have successfully applied the above methodology to beat performance baselines, established by decades of research using control theory and watched in amazement as our RL agents self-discover and surpass abilities such as:

  • 1st and 2nd order feedback controllers for nonlinear, dynamic systems

  • E.g. PID controllers

  • Spatial filtering

  • Predictive models analogous to Kalman filters

  • Etc.


[*] In practice the training process might have to be kickstarted using intermediate (or instantaneous) rewards. Also, in some cases, time varying rewards might be needed for practical tasks.


Sample Efficiency

The paper correctly points out that evolution optimizes for better sampling efficiency in biological systems and notes it as a challenge for today’s AI implementations. That’s a challenge we are taking on with DeepSim.


When it comes to compute resources for RL training, bigger and faster almost always equates to better performance and shorter development times. In publications touting the success of RL algorithms, it has become customary to brag about the scale and amount of compute resources used to achieve the desired results. The version of AlphaGo that beat Lee Sedol used 1,920 CPUs and 280 GPUs[6]. We routinely require thousands of cores on our DeepSim platform to perform parallel training runs to keep training times within 6 - 12 hours for experimental runs. Final production runs take days to complete. Now that cloud vendors have started offering the ability to provision “supercomputing” scale clusters on the fly, this will lead to the democratization of RL. In building DeepSim, we have collaborated with the Azure HPC team to seamlessly take advantage of on-demand scaling features to bring the advantages of the best of HPC to our customers.


The great news is that Moore’s law seems to be alive and well at the data center scale, with the hyperscalers continuing to drive down the cost of cloud computing year by year. Nevertheless we predict that the computational demands of RL training in the cloud will keep growing due to the following trends:

  • Simulator fidelity has finally bridged the sim-to-real gap to the point that the RL methodology has become commercially viable. We expect this trend to continue requiring greater amounts of cloud computing resources in novel ways. Heterogeneous clusters comprising CPUs and GPUs connected by InfiniBand are available on the Azure HPC platform and DeepSim is designed to effectively utilize these resources as needed for each training job.

  • Ever increasing complexity of autonomous systems

  • The proliferation of intelligent agents in every facet of Information technology (we believe this is the most practical path to Karpathy’s vision of “Software 2.0”).

  • Continuous improvement of deployed agents using off-policy learning.


Hence, this growth in computational requirements has to be met with better sampling efficiency of RL training. In the enterprise world we understand that commercial deployment will become more and more critically dependent on increasing sampling efficiency. With our DeepSim platform we are implementing various techniques to decrease the cost of training in terms of hours, dollars and joules. Our proprietary IP is based on some of the following techniques:

  • Curriculum learning

  • Intelligent sampling of scenarios

  • Intelligent selection of hyper-parameters such as batch size, learning rates, number of workers etc.

  • Intelligent choice of computing resources

  • Reusing learned skills across various reward functions (since skills are largely dependent on the environment and independent of reward function)

  • Creating more efficient Neural Network architectures

  • Feature engineering using the help of domain experts

  • etc.


In conclusion, we do not know if reward maximization algorithms can or will lead to General Intelligence, but we are confident that the proliferation of RL agents will profoundly impact the way that software is created and deployed. After all, given the implications of the strong interpretation of the Church-Turing conjecture, the SW 2.0 trend is just another approach to AGI. This would suggest that the grand claim of the authors is indeed correct.



References

[1] https://deepmind.com/research/publications/Reward-is-Enough

[2] http://deepsim.ai

[3] https://www.cs.toronto.edu/~fritz/absps/imagenet.pdf

[4] https://arxiv.org/abs/2102.04842

[5] https://arxiv.org/abs/2102.12627

[6] https://en.wikipedia.org/wiki/AlphaGo