Darkcanuck/RRServer/Ratings

I'd like to open up a discussion on what ratings are meaningful in the rumble. There are various discussions scattered around the old wiki, but right now we have an opportunity to experiment with new things (on my server) and compare to existing results (ABC's server).

Here's what I've implemented on the new server and my reasons for doing so.

Average Percentage Score (APS)

This is probably the "purest" measure of a bot's performance. Under ideal conditions (i.e. full pairings and at least 20-50 battles per pairing to reduce variability), APS would allow an accurate comparison between all bots in the rumble. It's uncertain how well it works with less battles or incomplete pairings.

The server calculates APS for each bot by:

• taking the average percentage score of all battles against each opponent separately to get an APS for each pairing,
• then averaging all pairing scores to obtain the final average.

This differs from the old/current rumble server which weights newer scores higher than previous ones. The old system uses something like: `newaverage = 0.3 * newscore + 0.7 * oldaverage`. The reasons behind this weighting were to account for learning/data-saving bots improving in performance over time and to discard "bad" results. However, I believe that this actually increases score variability so I've chosen to use a true average which will smooth out results. Learning bots should still see their APS go up over time, but now there is no special advantage given to a bot which saves data between battles.

Glicko Rating System

Created by Mark Glickman, the system is described here: http://math.bu.edu/people/mg/glicko/glicko.doc/glicko.html The main difference from the Elo system is that each competitor has a ratings deviation (RD) in addition to their rating. New competitors start out with a high RD, which gradually drops as the rating settles. In Elo, winner and loser receive equal but opposite ratings adjustments after a battle, whereas in the Glicko system the adjustment is based on the RD value. A high RD results in a bigger adjustment, so new competitors are adjusted more quickly; established competitors ratings' should change more slowly.

The server implements the system to the letter, with the one exception that RD values in the rumble do not "decay" (increase) with inactivity. Scores start at 1500, RD values start at 350.

Glicko-2 Rating System

Also by Mark Glickman: http://math.bu.edu/people/mg/glicko/glicko2.doc/example.html This is an enhancement of the original Glicko system which adds a volatity measure to each score. The calculations are more involved but it should (in theory) provide stable ratings for even competitors whose performance is erratic (ie. high specialization index in rumble-speak).

Again the system has been implemented as described but without RD decay. Tau is set to 0.5 and the iterative volatility calculation has been bounded to at most 20 iterations.

RoboRumble Elo Ratings

I've tried to understand the old server's rating scheme and implement that system as faithfully as possible. Thanks mostly to the work of Nfwu and his commented [| EloSim] code, plus details scattered about the old wiki, I've pieced together the following:

• New competitors start at a rating of 1600
• The expected % score outcome of a pairing between bots A and B is given by E(A,B) = 1.0 / (1 + 20^(ratingA-ratingB)/800))
• When a new pairing result is submitted to the server:
1. The new % score for bot A vs bot B `New(A,B) = scoreA / (scoreA + scoreB)`
2. The running pairing score of A vs B `Pair(A,B)' = 0.7 * Pair(A,B) + 0.3 * New(A,B)`
3. Then calculate the rating change for A by iterating over all ranked bots Ri:
• `deltaRatingA += 3.0 * (Pair(A,Ri) - E(A,Ri)`
4. Do the same for B
5. Update the ratings for A and B by adding the new delta to their current rating.

As mentioned above, this server uses the true average for Pair(A,B); otherwise the calculations are the same.

Calculation Notes

One of Albert's most interesting changes to the standard Elo calculation is that each new battle result requires updating a bot's rating by iterating over the whole participant list, not just the one involved in the battle. I have no idea how this might affect rankings inflation/deflation, but I think it contributes significantly to stabilizing the ratings faster since every time a new result comes in, we re-calculate that bot's rating compared to the entire field.

This is computationally expensive, but since I had to put it in for Elo, it was easy to do the same for Glicko and Glicko-2. Thus all three ratings systems are using the same method, just different calculations for the expected and ratings delta values.

Based on what I've read about the three systems, Glicko-2 would probably yield the fastest-stabilizing results, but that still needs to be determined by comparing the three side-by-side. It's quite possible that Elo is "good enough", thanks to Albert's modification.

Alternative Systems

Recently, MN has proposed a batch ELO rating based on pairing win rates (i.e. # wins/battles for each pairing). This has led to a few interesting discussions. I've done a bit of research on alternate ratings and put together a comparison table on the rumble server, using a database backup from 8-Aug-2011: http://darkcanuck.net/rumble/altrankings.html

The new systems tried were:

• MN's batch ELO ratings.
• Schulze (Condorcet) voting method, using pairing APS as the vote tallies.
• Tideman Ranked Pairs (also Condorcet) voting method, also using pairing APS.

Note that MN's system produces results very similar to the current PL rankings (ignoring ties). The two Condorcet methods produce some very interesting results, although the reason why some bots get rated much higher under these systems is not very intuitive. Rankings might be slightly different if win% was used instead of APS, but I think APS is a better measure and leads to less ties. All three systems have the disadvantage of requiring offline (batch) calculation with significant memory and CPU usage.

Personally, I still prefer the APS and PL rankings because they are both intuitive (easy to understand the rankings, what needs to be improved to move up the ladder), precise (no ambiguity or possibility of drift) and simple to calculate online. The two systems are also complementary, APS measuring scoring perfection, PL measuring pure winning ability. I should also point out that PL is actually a Condorcet method (Copeland).

Further investigation into Borda (another voting method) might be interesting, since it is a simple calculation and considered to be an excellent voting method where there is no strategic voting. I'm not sure how different it may end up being from PL, however.

I've also found an enhancement on traditional ratings for multiplayer environments which promises to be better than ELO/Glicko for melee: http://jmlr.csail.mit.edu/papers/volume12/weng11a/weng11a.pdf

Comments, discussion welcome on the talk page!