Pong – a path entropy inspired solution

by | Sep 18, 2018

Most of the machine learning tasks are using a concept of “error” (also called loss in deep learning or fitness in genetic programming or reward in reinforcement learning) in order to quantify the difference between the expected and actual performance of an AI system. For instance, if an artificial neural network \(F\) is expected to output 1 for the input \({ i }_{ 1 }\)  and 0 for input \({ i }_{ 2 }\), the error \(e\) over the 2 samples can be defined using the L1 norm: \( e=\left| F\left( { i }_{ 1 }-1 \right) \right| +\left| F\left( { i }_{ 2 }-0 \right) \right| \). This so-called “supervised learning” approach is used in the vast majority of deep learning systems and has proven to be very effective. Nevertheless, it has one major drawback: it requires to explicitly provide to the system a number of examples where the input and output are known. As a result, in the field of deep learning, this requirement has led to the creation of an entire new industry and new job profile (“AI trainer”) only dedicated to the generation of those training data.

In practice, there are many cases where this training data is unavailable, for instance, how do you define the training data for a “thinking machine”? One approach consists in using a more general behavioral metric. For example, consider the case of the Pong game where the intelligent agent must control a bouncing ball with a pad. The traditional approach would be to train a neural network with a number of examples of input images of the pad game and the corresponding move of the pad. A less supervised approach was proposed by Wissner Gross. He says that an intelligent behavior consists in maximizing its future options. For this, he introduces the concept of “path entropy” which in this case would translate into “At one time step, how should I move the pad so that I have the most options for my next pad move?”. In practice this is not easy to compute, so we used an approximation which consists in saying that the pad should move in such a way that the ball is equally present over the entire pong game. We are actually replacing the path-entropy by a pure entropy \(E\) which can be easily computed using the ball probability presence \(P\) at the location \(\left( { x }_{ n }^{ t },{ y }_{ n }^{ t } \right)\) for the bin \(n\) over \(t\) time steps using \(E=\sum _{ n,t }^{ }{ P\left( { x }_{ n }^{ t },{ y }_{ n }^{ t } \right) } \times \ln { \left( P\left( { x }_{ n }^{ t },{ y }_{ n }^{ t } \right) \right) } \).

Now that we have a non-supervised (or at least “less-supervised”) performance metric, we still need to define an “engine” for our smart pong player. Based on our experience in this field, we chose to use a Linear Genetic Programming approach. With this approach, the source code of the software used to move the pad is written automatically using operators inspired by genetic selection. The software has access to the position of the ball and to the position of the pad, it returns the displacement of the pad. The source code (so-called “individual”) is generated using an intermediate language and converted in x64 machine code using our custom-made real-time compiler. We used individuals of up to 300 lines, with 256 populations (one per core) of 50 individuals, each population running on a HP BladeCenter system populated with 16 Proliant blades. The source code was evolved using mutation, crossover and migration operators. After 10 millions assessed individuals, we ended up with a fascinating solution, shown in the animation. After simplification of the source code, it corresponds to moving the pad at each time step according to the equation:

\(\Delta p=\frac { 1 }{ x } +2-\frac { p+y }{ x } +x-p-y\),

where \(\Delta p\) is the displacement to apply to the pad, \(\left( x,y \right) \) is the position of the ball and \(p\) is the position of the pad. Even knowing this equation, it is not obvious to understand why this solution works. By the way, this complexity demonstrates the power of using machine learning in general and genetic programming in particular. Indeed, it enables to develop system which complexity exceeds our understanding capabilities.

To conclude, we successfully solved Pong without explicitly indicating to the system that the pad should not miss the ball. Instead, we used the variability of the ball position across the board as the sole indication of performance. It is an interesting result since this approach can be used each time that it is impossible to express numerically a performance, which is typically the case of any high-level cognitive activity.