Introduction
Since this is still going on, I think it would be beneficial to break this down statistically. I started this as a reply but it reached sufficient length that I decided it deserved its own thread. The link to the original thread is here.
Max in the Shops Mirror
The model that best approximates a 'coin-flip' scenario in which there are two outcomes, determined by luck, with probability p, repeated n times, is a binomical distribution. Let us apply this to Max's experience with the mirror. We have the following parameters of n = 16 games and p = 50% or 0.5 (since it is a 'mirror'). Ignoring skill, Max should win
u = n * p = 16 * 0.5 = 8 games
Let's stop and double check that the result makes sense. You flip 16 coins and on average there will be 8 heads and 8 tails. Moving on to variance from that mean, we use the equation of
Var(sigma^2) = n*p * (1-p) = 16 * (0.5) * (1- 0.5) = 4.0 games
The standard deviation is typically more useful than variance and determined by taking the square root of the variance. The standard deviation is
Std Dev (sigma) = sqr rt(4) = 2
The final breakdown is:
Max's 15 wins in 16 matches is much higher than the 50-50 an average Shops player would obtain. Therefore, it would be pretty reasonable for a casual observer to say Max is an above average Shops pilot based on these results.
Max against the Field
There are two ways to establish the probability that can be used in these models. The first is theoretically derived, like we did in the first section. In the mirror the cards are assumed to be the same or close to the same, so if player skill is ignored, the theoretical probability of winning is equivalent to losing, or 50-50. However, if the cards are substantially different, i.e players are playing different decks, it is much more tenuous to assume a 50-50 win-loss record. You can make an argument for it: the tournament structure is such that a loss for one player equals a win for another, so the overall record of the field must be 50%. If you do so and exclude Max's Shops matchups, you get
Max's actual number of wins, 66, and actual match win percentage, 78.5%, are much higher than what we would expect given a 50-50 win rate. It would be pretty reasonable to conclude that Max is an above average player with Shops against the field, too.
How can we 'science' up the above conclusion?
Science is conducted through the scientific method: you make a hypothesis, conduct an experiment, then reject or accept the hypothesis based on the results. "But Max didn't have an hypothesis." In many cases, data are collected before an actual hypothesis is made. The default position is that of the null hypothesis, that there exists no statistical difference between two groups of data. Put in the context of this experiment, we are essentially taking the position that there is no statistical difference between Max's results with Shops and the theoretical results of an average player with an average deck (average defined as 50% win rate). That is, Max's results happened purely through chance and neither skill or deck selection played a role.
Rejecting the Null Hypothesis (Confidence Intervals)
Max has collected his data - Now we have to determine whether or not Max is good or lucky. And honestly, we cannot know for sure. If you flip a coin 10 times and it comes up heads 10 times, would you conclude that this was luck or something nefarious was in play? The odds landing on heads 10 times is theoretically (0.5)^10 or 1/1024. Alternatively, the coin could be weighted so that it almost always comes up heads. Both are possible, right? There is that one-in-a-thousand chance and weighted coins exist. Granted, in this example the coin is severely disfigured and would be readily apparent that it was doctored... Still, if someone says to you "I just flipped 10 coins and had 10 heads" with no additional information, what should you believe?
This brings us to confidence intervals. We know that with any type of probabilities, there exists a range of theoretical outcomes and that certain outcomes are more likely that others. What we have to determine is our threshold for error or alternatively our confidence in the results. Luckily, we did most of the work already by calculating the standard deviations. There is a statistical rule called the 68-95-99.7 rule that states the likelihood of a certain result falling within one, two, or three standard deviations of the mean. Those ranges are given in the above charts. If the above games were played by an average Shops pilot, there is a 68% chance that the Shops pilot would win between 6-10 games, a 95% chance they would win between 4-12 games, and a 99.7% chance that a Shops pilot would win between 2-14 games. Max won 15 games and so his odds of being an average Shops pilot based on this data set are <0.3%.
Dividing the number of wins by the number of games played gives us a match win percentage that allows us to compare different sample sizes. Doing so shows how win rates can vary dramatically based on limited results (and why looking at small sample sizes is unreliable, like @Timewalking suggested). We would expect an average Shops pilot to win 37.5-62.5% of their matches 68% of the time, 25-75% of their matches 95% of the time, and 12.5-87.5% of the time over 16 matches. Conversely, the confidence intervals for 84 matches (the number of matches Max played against the field) are much smaller: 44.5-55.5% for one standard deviation, 39.1-60.9% for two standard deviation, and 33.6-66.4% for three standard deviations. Statistically, more data is always better. Max won 78.6% of these matches, so again, it strongly suggests that Max is an above average Shops pilot and/or that Shops is an above average deck.
By convention, those in the medical field (and many other fields) tend to use the cutoff of 95% (2 standard deviations) as a statistically 'true' result. Max is well beyond that, so we can statistically conclude what most of us already concluded - that Max is not an average Shops pilot. We have a higher degree of certainty, of at least 99.7%, but it's much simpler mathematically to stop here for now.
What other meaning can we derive from the data?
There are really two other questions/observations that emerged from the thread concerning Max's article.
- Does Max's higher win rate in Shops mirrors (94% vs. 79%) suggest that Shops is actually a weaker deck against the field?
- Does Max's 81% win rate in total and 79% win rate against the field suggest that Shops is an above average (or good deck) in the metagame?
Let's start with the first as it is easier to address. The argument assumes that skill in one matchup is transferable to another, that the Shops mirror is inherently a 50-50 matchup, and that since Max won at a higher rate against Shops than against other decks, the skill-independent MWP of Shops is below 50% (making it a 'bad' deck). While considering assumptions is really important in interpreting data, it actually doesn't matter much statistically. The numbers are what they are: Max won 94% of matches against Shops in 16 matches and 79% of matches against non-Shops decks. The question is whether or not this discrepancy is real.
Is there a statistical difference between Max's results against Shops and Max's results against the field?
The second way of determining a probability (and by far the most common) is to do so experimentally. We don't know how many matches Max should win when we factor in his skill and his deck selection. How good is Max? How good is Shops? How good is Max with Shops? Again, we don't for sure, but one thing we can do is have Max play a bunch of matches with Shops to give us an experimental value for his win probability. Well, Max already did that so let's use Max's win rate against other decks as a starting point. Max won 66 of 84 matches, for an experimental probability (P because I don't know how to add a circumflex to the letter p) of ~79%. What is our confidence interval for this value? Well, there are several ways to calculate confidence intervals of experimental means based on sample sizes. Easiest one to use is a normal approximation interval or Wald method where the range is:
The constant z depends on the desired confidence level - for 95%, z is 1.96. Punching the numbers in, we get an experimental probability of 0.79 +/- 0.09 and a range of 70-88%. Max's win rate of 93.5% is outside of this range, implying that there is statistical significance in the discrepancy between the Shops mirrors and matches against the rest of the field.
Does a statistically significant result actually tell us what we want to know?
Now it's time to look at our assumptions. We assumed that
- Mirrors are inherently 50-50.
- Skills with a deck are transferable between the mirror and other matchups.
- Skill differences affect outcomes in other matchups to the same degree .
I can poke holes each of these arguments. The first assumption is that mirrors are inherently 50-50, but that ignores the fact that 'true' mirrors are relatively rare. Most decks are not 75 card copies of each other, and most classification schemes lump similar decks into the same archetypes. For Shops, this includes Ravager Shops but Stax, Rod, and other variants. Ravager Shops tends to destroy these other versions, which is part of its dominance within the metagame. Foundry Inspector breaks the symmetry of Sphere effects and is unaffected by Null Rod, the threat base is wider and lower to the ground (i.e. many creatures that can be cast cheaply), and the mana denial is much more effective against other decks with higher mana curves. Max went at least 5-0 against these 'mirrors' which arguably should be considered decks. If one assumes that the remaining 10-1 record was against other Ravager decks, that gives a win probability of 0.91 +/- 0.17, or a lower limit of 74%. This result is no longer statistically significant.
For the second and third assumptions, Max and I both stated that we thought the mirror tested different skills and was very skill-intensive (i.e. that the skill discrepancy with a deck went a long way to predicting the winner). The Ravager mirror does have blowout potential but many games develop into complicated board stalls with key pieces such as Walking Ballistas, Arcbound Ravagers, Steel Overseers, and Hangarback Walkers shutdown by Phyrexian Revokers. Oh, and Metamorphs, Wurmcoils, and Precursor Golems providing powerful threats to be navigated. Complex combat math is arguably the most valuable skill in the mirror, with sequencing less important. These types of scenarios are uncommon in other matchups and the combat math is much more simplistic as most opposing creatures provide few decision trees (most creatures are vanilia x/x's like tokens and creatures with abilities tend to be static like the lifelink of Griselbrand or triggered and predictable like Inferno Titan). Sequencing is more important for the Shops pilot who assumes the proactive role. Skill from the other side of the matchup is also minimally interactive - as Max said, either the opponent kills all your threats or deploys a massive trump like Blightsteel through Spheres and mana denial, or they don't and die. That is more draw and die-roll dependent than skill based.
I think that statistically significant results in this case point to a couple of possible conclusions. First, I think the most likely explanation is that the Shops mirror tends to be less variable than other Shops matchups. This doesn't require assumptions about the transferability of skills from one matchup to another. It actually assumes the opposite of assumption #3 in that it assumes matchups are influenced by skill to varying degrees. Max reached this conclusion as well. I think it is less likely that Shops is weaker than other decks in the field, because we have more premises that I find hard to logically accept to reach that conclusion.
Does this article indicate that Shops is an overpowered deck in the metagame?
Short answer is "No". That type of question is much better answered by our metagame breakdowns. Again, more data is better and you mitigate issues of player skill by having a much larger sample size. Applying the same statistical tests to this most recent Vintage Champs gives a win rate of 59% (+/- 5%). In this sample size of 404 matches played by 72 players, it's pretty statistically clear that Shops is a good deck. Is it the best deck? Oath is the closest of the other archetype with a win rate of 55% (+/-6%). Those confidence intervals overlap, so you can't statistically claim that Shops is the best archetype. The answer of course is "more data". When you look at results from the Vintage Challenges and other tournaments (taken collectively), ideally it paints a consistent and accurate picture of reality. That's how science works...you do radiometric dating of a bunch of radioactive minerals and when many different labs reach a consensus of 4.5 billion years old, that's what they put in the textbooks. Would people be interested in a large scale analysis of all available metagame data (in essence, a meta-analysis or the strongest form of scientific evidence in medicine and other areas of science)? I am willing to do this, but I would like confirmation that players would be receptive to the data.
Alright, back to one 100 match set played by one player. We can agree that Max's skill has skewed his results away from that of an average player. The question is what additional component arises from Max's deck selection. Again, we have to make various assumptions. We don't know Max's 'true' win probability with other decks, but he has stated that he has won roughly 70% of his matches in PTQ's. If we accept this figure as accurate and assume that this MWP is transferable to Vintage, and assume that PTQ's are comparable in level of competition to Vintage leagues, then we can use this 70% value as a theoretical probability. In this case,
The confidence interval has an upper limit of 79, which suggests with 95% certainty that Max's results are not just a product of variance. He won 82 games. If you exclude Shops decks, you are at the edge of statistical relevance (remember our confidence interval from that data set was 70-89). If you exclude true Ravager Shops mirrors and include Shops variants, you are back above statistical significance with a confidence interval of 72-88 MWP. Given the proximity to the limits and the assumptions required, I would not personally conclude from this that Shops is an above average deck in the metagame.
Hopefully this type of data analysis was informative and accurately conveys some of the challenges with regards to interpreting data. Questions and comments? Please, let me have them.