So, after actually doing the experiment rigorously, I found two really important, completely unexpected results which I hadn't even noticed in the first, seat-of-the-pants empirical testing.

Most importantly: the networks with a mix of learning rates are far less likely to converge on local peaks in the fitness landscape, and find a global solution.

Artificial Neural Networks are kind of like a blind mountain climber; they can tell which way is "uphill" where they are, and they can proceed until they can't find "uphill" in any direction, but they don't know whether they've climbed the tallest mountain in the area; only that they're at the top of something.

When the very slow-learning weights are considered as constants, a fitness landscape is revealed in which there are certain directions you can't move, but you can find a highest point in the directions you can move - at least a local peak and possibly a global peak. A global peak is defined as the behavior of the network matching the ideal - which may be achievable with many different combinations of weights, in which case you have many global peaks all at the same height.

If it's a global peak, then movement of the slow-learning weights won't reveal anything you haven't already achieved. But if it turns out that it's only a local peak, then movement of the slow-moving weights gradually reveals a whole new fitness landscape - in which the altitude you've already reached is never reduced, and the fast-moving weights are likely to discover a new path for rapid upward climbing.

Rinse, repeat. In order to truly be a local fitness peak, a given achievable behavior must be a local peak in all of a long succession of different fitness landscapes. Which is different from being the first peak found in a fitness landscape of simple higher dimensionality because in the process of successively converging on newly-revealed "local peaks" the fast-moving weights do a heck of a lot of exploring the solution space that otherwise wouldn't happen, and in a highly nonlinear space, that truly matters.

Second in importance: they do it without needing conjugate-gradient methods or Levenberg-Marquardt, which are really awesome methods but intractable on very large networks. Conjugate-gradient requires second-order derivatives on the square of the number of nodes in the network, and Levenberg-Marquardt is even worse. They give you networks of awesome efficiency for their size, but .... they limit the size, and therefore the complexity of problems you can attack. Badly.

So... I'm a bit boggled by this whole thing. I'm more than a bit boggled by my experimental results. We can use smaller networks now and still find good optima, which speeds the whole thing up and once the configurations are discovered drastically reduces the amount of compute power required by the resulting networks. Of course we're also that much more likely to overfit our training data that much more precisely. And we can use bigger networks and have benefits like those of the second-order methods, without paying impossible prices in compute time for them.

And I'm going heads-down again, finding cites and preparing graphics and looking at the conference schedules. And also going into mad-scientist mode and having a crack at consciousness.... Not with an expectation of success, but with the knowledge that I just kicked a few of the long-standing obstacles out of the way. Now we can get past them and see what other obstacles there are....

Most importantly: the networks with a mix of learning rates are far less likely to converge on local peaks in the fitness landscape, and find a global solution.

Artificial Neural Networks are kind of like a blind mountain climber; they can tell which way is "uphill" where they are, and they can proceed until they can't find "uphill" in any direction, but they don't know whether they've climbed the tallest mountain in the area; only that they're at the top of something.

When the very slow-learning weights are considered as constants, a fitness landscape is revealed in which there are certain directions you can't move, but you can find a highest point in the directions you can move - at least a local peak and possibly a global peak. A global peak is defined as the behavior of the network matching the ideal - which may be achievable with many different combinations of weights, in which case you have many global peaks all at the same height.

If it's a global peak, then movement of the slow-learning weights won't reveal anything you haven't already achieved. But if it turns out that it's only a local peak, then movement of the slow-moving weights gradually reveals a whole new fitness landscape - in which the altitude you've already reached is never reduced, and the fast-moving weights are likely to discover a new path for rapid upward climbing.

Rinse, repeat. In order to truly be a local fitness peak, a given achievable behavior must be a local peak in all of a long succession of different fitness landscapes. Which is different from being the first peak found in a fitness landscape of simple higher dimensionality because in the process of successively converging on newly-revealed "local peaks" the fast-moving weights do a heck of a lot of exploring the solution space that otherwise wouldn't happen, and in a highly nonlinear space, that truly matters.

Second in importance: they do it without needing conjugate-gradient methods or Levenberg-Marquardt, which are really awesome methods but intractable on very large networks. Conjugate-gradient requires second-order derivatives on the square of the number of nodes in the network, and Levenberg-Marquardt is even worse. They give you networks of awesome efficiency for their size, but .... they limit the size, and therefore the complexity of problems you can attack. Badly.

So... I'm a bit boggled by this whole thing. I'm more than a bit boggled by my experimental results. We can use smaller networks now and still find good optima, which speeds the whole thing up and once the configurations are discovered drastically reduces the amount of compute power required by the resulting networks. Of course we're also that much more likely to overfit our training data that much more precisely. And we can use bigger networks and have benefits like those of the second-order methods, without paying impossible prices in compute time for them.

And I'm going heads-down again, finding cites and preparing graphics and looking at the conference schedules. And also going into mad-scientist mode and having a crack at consciousness.... Not with an expectation of success, but with the knowledge that I just kicked a few of the long-standing obstacles out of the way. Now we can get past them and see what other obstacles there are....