Finding insights in complex data often requires searching for the maximum of a value that depends on many parameters and variables. In a simple one-dimensional case this can be handled with standard optimisation routines. This post will not delve too deeply into the enormous field of mathematical optimization, but will try to give just enough to understand the reasons for the new methods discussed at the end of the article. After a brief overview of optimization algorithms, we will turn our attention to an interesting method developed by Remi Munos, a researcher at Google's Deepmind. It is a particularly useful algorithm when the data becomes extremely complex and when the cost of sampling the data is large. The cost associated can be time, money, or any other resource. The important point is that this branch of machine learning is concerned with minimizing the expenditure of that resource to intelligently search for the desired result. Because the algorithm adapts to the data as it is received, the field is referred to as adaptive sampling.
Application to Real-World Problems
The above introduction is all very interesting, but if you’re still here, I can imagine that you are wondering why you care about the abstract mathematics of function optimization. Obviously, it is extremely important in scientific research, economic modeling, etc, but it can be less obvious how this can be applied to real-world problems. Imagine you are part of a formula-one racing team and you need to adjust the gear ratios so that you get the perfect blend of acceleration and top speed. This is an optimization problem. There are variables that you can change that affect the acceleration and the top-speed that follow certain rules. If you adjust the gear ratios to the same position twice, you will get the same engine performance. This means it has a deterministic solution.
What if you are an SEO agency with data coming in from all your clients? There is a pattern in that data. There is a correlation behind the traffic that the sites of your clients receive and the things that are changed on the site, the work performed on attaining back-links and your Adwords campaigns. Highly experienced people seem to know instinctively what will work, to the point that it can seem like black-magic. With advanced data wrangling techniques the veil can be lifted from these dark-arts and meaningful, intuitive insights can be gleaned.
Numeric vs. Analytic Optimization
Optimization routines generally fall into two categories; analytical or numerical.
Analytical methods essentially mean sitting down with a known or assumed function and using calculus on this equation to mathematically solve it to find the maximum or minimum as required. Of course, this can be performed programmatically too with homebrewed software or something out of the box like Mathematica. This method needs the function to be simple enough and suitably formed so that standard calculus methods will solve the problem. However, for many functions the time involved with this process is so long that it isn’t practical, or it is simply impossible to solve them with these methods.
Numerical methods are used when analytical methods aren’t practical or possible. A simple way to think about numerical methods is that the problem is solved by numerically simulating it. By using algorithms, the problem can be approximated and iterated over until a solution is converged upon. Numerical solutions are what we will deal with over the rest of this article. The advent of computers has significantly enhanced this type of analysis as millions of iterations of a problem can be performed a second. There are still many pitfalls to these techniques, as we will see in the next sections.
Local vs Global Maximum
A major problem with many optimization routines is they easily find a maximum or minimum, but that this might simply be locally true. Hill-climbing is one of the simplest algorithms, where an initial guess is made, the function is sampled and then it is sampled again a short distance in x to either side of the first spot. These values are compared and the largest value is taken as the next point. The process is repeated and the next largest point is taken as the new point. Thus, the iterations begin and this is repeated until a point is found that has lower values to either side.
In the image below a function with several maxima and minima is shown. The first step in the Hill-Climbing algorithm is shown in figure 1, where an initial guess is chosen to start the iterations at point A. From this point, depending on which way the solver starts traveling in, either of the two peaks B or C, next to the initial trough can be found, as shown in figures 2 and 3. As can easily be seen, these peaks are just local maxima and are far away from the actual global maximum that is highlighted at point D.
Advanced Optimisation Routines
There are a number of algorithms that address the problem of local extremes not being the full solution. Even the simple Hill-Climber algorithm can include random jumps to kick itself out of local extremes, but it is still not very efficient. Other more advanced solutions include simulated annealing, genetic algorithms, stochastic tunneling, and many others. However, as the number of variables or the size of the dataset to be sampled gets very large, these routines start to fall down as well.
Now let’s look at an example of a slightly more involved function. The image below shows a Schwefel function. It is displayed as a two-dimensional function, but can be extended to an arbitrary number of dimensions. It is an absolute mess of peaks and troughs. The processing power that is needed to find a global maximum like this is an order of magnitude larger than a simple one-dimensional function with a couple of peaks.
While the Schwefel function is a known quantity, because it is defined by an equation that we have access to, it is a good test-bed for any algorithms that we develop. It is an intricate chain of peaks and troughs and if the algorithm has no knowledge of the underlying function it can be a good place to test how quickly the solution can be found, or rather how few samples can be taken before a good approximation is found. Especially when this is extended to have many dimensions.
The data generated from real-life applications can be very much like this. Large numbers of variables combined with an unknown relation to the desired results can cloud data and make finding the right path to our optimal solution difficult to find. That is why machine learning and adaptive sampling is increasingly being used to solve such matters.
The majority of the adaptive sampling algorithms developed by Remi Munos and his colleagues are concerned with stochastic functions. A stochastic process is a random process where the data at a specific point can change randomly from moment to moment within certain constraints. The prices of companies on the stock market are good examples of stochastic data.
The issue they investigated is defined as a long line of one-armed bandits (or fruit machines, depending on continent) and each bandit has a certain chance of paying out which is unknown at the start. The driving force is to find the bandits that have the highest odds of winning, while spending the smallest amount of money in the process. This is split between exploration, where new bandits are tested repeatedly to find out the odds, and exploitation where the bandits with the highest known odds are used over and over again to get as much money as possible.
Deterministic Optimistic Optimization
In the exhaustive paper that he published on the adaptive sampling algorithms, The Optimistic Principle applied to Games, Optimization, and Planning: Towards Foundations of Monte-Carlo Tree Search, the stocastic algorithm was adapted to work with deterministic functions. Again, by deterministic we mean that the same set of variables will give the same outcome each time. This is in contrast to the stochastic scenario. In the case of a deterministic function we are only worried about the exploration phase and avoiding becoming stuck in a local maximum. It is called the Deterministic Optimistic Optimisation (DOO) algorithm, because it is for use on deterministic functions and it is optimistic in the sense that it assumes that the data it has about the data at any one point in time is the best outcome and uses that to adapt its search.
This algorithm gives a great introduction to the way that the basic search is performed, as it is much more straightforward than the stochastic case. The interactive graph below allows you to play around with this.
To start we have a one-dimensional function over a known range in x. The reason we have limited the demo to one-dimension is purely to make it easy to understand. When there are more dimensions the algorithm rotates between expanding its search in different dimension so that all are explored equally.
The word dimension can be misleading as it brings to mind that we are searching across some area of space, but this isn’t really that case. To bring this back to the real-world, let’s think about that formula-one car again. We still want to maximize the acceleration and top-speed in a useful way. One dimension might be the gear ratios, a second dimension might be the fuel mixture ratios, a third dimension could be the angle of the spoiler changing the down-force on the rear wheels, and so on. Each sample would be making the adjustments and then taking the car out for a lap of a racetrack. There is only a finite amount of time to perform these actions in a week before the next race, so an intelligent search is necessary. Now bear with me, I understand this is an imperfect metaphor, because formula-one race teams have most of this stuff pretty well figured out and aren’t just blindly stabbing in the dark at how to change the gear ratios.
In the demo, the algorithm has no idea that it will be sampling a Schwefel function. The demo starts with the function to be optimized showing, but this can be hidden by clinking the button to simulate what the algorithm sees. The first step in the search is to divide the area up into 3 nodes and take a sample at the centre of each node. These first nodes are denoted level 1, as they have only been split up once. The nodes can be displayed by clicking the “show nodes” check box and it will show them the next time the “expand node” button is pressed.
Now the optimistic part comes in. It assumes that the largest value of the three is the best chance it has at maximizing the output. Therefore, when the “expand node” button is pressed, it looks at that node and splits it up once again into 3 smaller nodes and samples the centre of the node. Now these nodes are of depth 2, because to get to this level of node we have expanded 2 nodes.
The marker on the node depth chart at the top is now underneath the level 2 box. That indicates that we are now looking for the largest value node of depth 2 or lower. Hitting the expand node button once more will pick the largest value at node 2 or below and expand that node once more.
Now the exploration phase kicks in. The node depth marker is indicating that nodes of depth 1 are to be explored next. It looks for the largest value of a node at depth 1 and expands that node. This is the part of the search that kicks it out of local maxima and ensures the whole area is investigated. It doesn’t take too many more node expansions to have found the real maxima and start exploring that in fine detail.
Adaptive Sampling of a Schwefel Function
The power of these search algorithms is not in finding a global maximum on a one-dimensional function, but this system allows an intuitive understanding of the approach to be built upon. It becomes incredibly powerful when large data sources are investigated and when each sample uses a lot of resources. The search can be stopped at any time once a certain amount of the search resources have been used. This method gives you the best chance of finding the best approximation of a global maximum that could be achieved with that resource expenditure.
When the stochastic elements are included, non-deterministic process can be investigated intelligently. Processes like how many people click on a link in a day, or whether Google has changed its algorithm are great examples of stochastic processes. With an appreciation of the new techniques being developed to analyze data and to search for new opportunities a much more insightful use of your data can take place.