Best/Worst Performance Measures

@Smmenen Right.
You can chose X to include whatever bracket you want so that you can look at the performance of a deck that would get a zero when using X = 8 (Top 8 bracket).Edit: Also, if you choose X for each event you want to compare such that you are including the top Y percentile performing decks in the field then you could use the scores to compare decks across multiple events. ie; If you chose X based on N so that the ratio of the two is the same (or as close as possible) for each event then the scores of each deck should be normalized.

Is there a way to scale the score or value so that it resembles something more intuitive?

The main problem with trying to make a performance measure is the problem of "deck identity." Let's say you put Scornful Egotist in Doomsday. Remove Black Lotus and you've strongly impacted the deck's ability to perform. Remove a Preordain and the resulting impact is so small that you can't measure it in a single tourney. So, how do you define a deck when cards make wildly unequal contributions such that some variations barely matter?

Ideally the performance measures would also take into account the strength of the competitors; i.e. it would jointly solve for an N x N matrix of matchup win probabilities, and an P x N matrix of how well each player can pilot which deck.
Of course this is only tractable if there are many players who play many different decks in the results data.

@Smmenen said:
Is there a way to scale the score or value so that it resembles something more intuitive?
The formula as I read it is already scaled. You are basically taking the fraction of deck X in the top 8 (A/B) and dividing it by the fraction of deck X in the metagame (X/N). This is a measure of a deck's throughput into top X with a ratio of >1 indicating a deck outperformed it's slice of the metagame and a ratio of <1 underperforming it's percentage of the metagame. It's a really nice equation @AaronPatten and something I would like to use going forward (if you don't mind) but it's limited in that you need to have a complete breakdown of the metagame to calculate it.

@ChubbyRain said:
@Smmenen said:
Is there a way to scale the score or value so that it resembles something more intuitive?
The formula as I read it is already scaled. You are basically taking the fraction of deck X in the top 8 (A/B) and dividing it by the fraction of deck X in the metagame (X/N). This is a measure of a deck's throughput into top X with a ratio of >1 indicating a deck outperformed it's slice of the metagame and a ratio of <1 underperforming it's percentage of the metagame. It's a really nice equation @AaronPatten and something I would like to use going forward (if you don't mind) but it's limited in that you need to have a complete breakdown of the metagame to calculate it.
No, you are right.
Originally, I thought that upper bound might provide a really odd number, but the upper bound is always going to be 100, and the lower bound 0, with 1 signalling a useful pivot point. So, by "scaled" I meant translate the number into something that can be intuitively useful, but it already is!

@AmbivalentDuck said:
So, how do you define a deck when cards make wildly unequal contributions such that some variations barely matter?
It seems like most people just pick certain key cards and count their inclusion as the definition of the archetype. Like calling any deck that plays both Gush and Monastery Mentor a GushMentor deck. There are other ways bu they're a bit more involved. I'm still working through some ideas in that department my self.
@ChubbyRain said:
@Smmenen said:
Is there a way to scale the score or value so that it resembles something more intuitive?
The formula as I read it is already scaled. You are basically taking the fraction of deck X in the top 8 (A/B) and dividing it by the fraction of deck X in the metagame (X/N). This is a measure of a deck's throughput into top X with a ratio of >1 indicating a deck outperformed it's slice of the metagame and a ratio of <1 underperforming it's percentage of the metagame. It's a really nice equation @AaronPatten and something I would like to use going forward (if you don't mind) but it's limited in that you need to have a complete breakdown of the metagame to calculate it.
Exactly right, and I'd be delighted to see it used.
Within a top 8 you could still use it to measure the performance of the finalists or even the top 4 but It is limited, in that respect, by how much data is available.@Smmenen said:
@ChubbyRain said:
@Smmenen said:
Is there a way to scale the score or value so that it resembles something more intuitive?
The formula as I read it is already scaled. You are basically taking the fraction of deck X in the top 8 (A/B) and dividing it by the fraction of deck X in the metagame (X/N). This is a measure of a deck's throughput into top X with a ratio of >1 indicating a deck outperformed it's slice of the metagame and a ratio of <1 underperforming it's percentage of the metagame. It's a really nice equation @AaronPatten and something I would like to use going forward (if you don't mind) but it's limited in that you need to have a complete breakdown of the metagame to calculate it.
No, you are right.
Originally, I thought that upper bound might provide a really odd number, but the upper bound is always going to be 100, and the lower bound 0, with 1 signalling a useful pivot point. So, by "scaled" I meant translate the number into something that can be intuitively useful, but it already is!
The upper bound for this performance metric will always be:
N/X
So for any given event the values will all have the same upper bound and if you chose to maintain that ratio throughout your comparisons between different events they would all be scaled to match. If you decided to compare the top 10% for multiple events, for example, you would consistently have an upper bound of 10. If you chose to compare the top 20% you would consistently have an upper bound of 5. For the top 25% the upper bound would be 4, etc.

If I understand correctly, the aspiration here would be to find a community supported method of objectively measuring performance such that we could mathematically help determine banned and restricted list policy. A nonexhaustive few problems stand out:

22 years after opening my first Revised starter, I still have no idea who or what precisely comprises the "DCI." Since I have no idea how these decisions are actually made, or who performs them, I'm hesitant to believe that even with a "perfect" algorithm, this mysterious decisionmaking body would substitute the findings for its own procedures and judgment (or lack thereof, as some may criticize).

It's already been established that "fun" is a factor taken into consideration when changing the B&R list and this seems to on its face preclude the use of a strictly mathematical determination.

I agree with the DCI that subjective factors and factors existing outside of tournament results should play a role as we're talking about an enterprise whose goal is maximum enjoyment by the human beings who participate. Additionally, I believe the secondary market should be an acceptable factor for consideration. Restricting something like Mishra's Workshop for instance causes a great deal of tangible harm to community members who are our friends, teammates, TO's, and so forth. That's not something we can or should easily brush aside.

Focusing on performance of individual cards is likely to yield seemingly mathematically incontrovertible conclusions like "Polluted Delta should be restricted." Since we know intuitively, that is "wrong," we'd apply an asterisk to set the result aside. And then we'd do that for Force of Will, Wasteland, Flooded Strand, Tundra, and so forth and the whole pretense of objectivity would look like a sham.
B


@brianpk80 said:
If I understand correctly, the aspiration here would be to find a community supported method of objectively measuring performance such that we could mathematically help determine banned and restricted list policy.
In brief, that is not the goal of the thread or the OP. Rather, it is an attempt to discuss and search for the best set of metrics (and to identify the limitations of others) for evaluating deck performance. Once possible application of such a metric is to inform DCI policy, but that is not the only such application.
Now that we have more metrics than ever before, such a conversation is valuable in it's own right, so that we can discuss the advantages and disadvantages of various measuring sticks.
A nonexhaustive few problems stand out:> 1. 22 years after opening my first Revised starter, I still have no idea who or what precisely comprises the "DCI." Since I have no idea how these decisions are actually made, or who performs them, I'm hesitant to believe that even with a "perfect" algorithm, this mysterious decisionmaking body would substitute the findings for its own procedures and judgment (or lack thereof, as some may criticize).
Those are two separate problems by my count. Who or what constitutes the DCI is not the same issue as to whether any set of metrics would be used by the DCI. But, in any case, it shouldn't undermine the search for better metrics. By the same token, no one knows exactly what kinds of metrics goes into the Fed Reserve's management of monetary policy, but data like unemployment rate, inflation rates (CPI, etc) are included. Quantitative information does not mean the objectification of policy making. The policy makers then have to figure out how to weigh all of the data they have, etc.
 It's already been established that "fun" is a factor taken into consideration when changing the B&R list and this seems to on its face preclude the use of a strictly mathematical determination.
No doubt subjective information plays a role in DCI decision making. But that should not preclude the search for better metrics (or a discussion on the merits of existing ones) for measuring deck performance.

I agree with the DCI that subjective factors and factors existing outside of tournament results should play a role as we're talking about an enterprise whose goal is maximum enjoyment by the human beings who participate. Additionally, I believe the secondary market should be an acceptable factor for consideration. Restricting something like Mishra's Workshop for instance causes a great deal of tangible harm to community members who are our friends, teammates, TO's, and so forth. That's not something we can or should easily brush aside.

Focusing on performance of individual cards is likely to yield seemingly mathematically incontrovertible conclusions like "Polluted Delta should be restricted." Since we know intuitively, that is "wrong," we'd apply an asterisk to set the result aside. And then we'd do that for Force of Will, Wasteland, Flooded Strand, Tundra, and so forth and the whole pretense of objectivity would look like a sham.
B
Yeah  I think your points flow from the mistaken assumption that the purpose of this thread is the narrow goal of identifying a perfect measurement that can then solve DCI policy making. No such measure exists. But it is important to have a discussion on the advantages and limitations of existing metrics, and to search for better approaches in the era of big data.

Just as an additional examaple I've compiled a more complete breakdown for the top 8 of NYSE results in descending order:
Shops: ( 2 * 157 ) / ( 17 * 8 ) = 2.309
Dredge: (1 * 157 ) ( 11 * 8 ) = 1.784
Gush: ( 4 * 157 ) ( 51 * 8 ) = 1.539
Eldrazi: ( 1 * 157 ) ( 17 * 8 ) = 1.154
So from this we know that there are 2.31 times more Shops decks per capita in the top 8 then showed up to the event, 1.54 times more Gush decks per capita in top 8, 1.78 times more Dredge decks per capita, and 1.15 times more Eldrazi decks per capita.@brianpk80 Though Steve's goal is not quite as you described, I completely agree with all of those points. The only reason I agree with 3.5 is that I don't want to see a mass exodus from Vintage like what happened in 2008. I would personally prefer that the secondary market not be a factor in their considerations if it were safe to do so. It seems like having that constraint could go against creating the best Vintage format possible. Your point about Polluted Delta is one I think a lot of people miss in general. It's easy to notice the occasional turn 1 Tinker for Blightsteel Colossus because the result is dramatic but I suspect that basing restriction purely on performance as gauged by popularity would result in a Vintage format that is very different from what we have today. It's hard to create an objective argument against restricting cards like Polluted Delta or Underground Sea if you base your performance metric on how frequently they show up. I wonder how each of them would score using the formula.. The difference between Polluted Delta and Gush for example is that the prevalence of Polluted Delta gets ignored as acceptable while the prevalence of Gush does not. This seems like an oversight caused by some psychological aspect of the audience. If I had a complete set of deck lists in some kind of easily manipulable spreadsheet for some large events I might be able to see how each individual card in each event's metagame scores according to the formula. This sounds like a daunting task though, especially if I have to do all of that data entry. For the namesake cards of each deck it's easy to calculate their score from the metagame breakdown because the namesake card is used in the deck name but for other staples such as duals and fetchlands it's not obvious without decklists. Another interesting point about evaluating based on frequency of appearance or metagame presence is that it's plausible that large numbers of people could just be wrong about what they should bring and that there are so many of them that the likelihood of their success is increased. Metagame presence is definitely an important measure but I don't think it carries any real weight on it' s own without a performance metric to show that those decks are also performing at a certain level rather than just being in great abundance. It's just a measure of what people were betting on rather than being a measure of what performed well in an event.

@Smmenen said:
Yeah  I think your points flow from the mistaken assumption that the purpose of this thread is the narrow goal of identifying a perfect measurement that can then solve DCI policy making. No such measure exists. But it is important to have a discussion on the advantages and limitations of existing metrics, and to search for better approaches in the era of big data.
Oh I know that wasn't the explicit or avowed purpose of discussing which metrics should be used for determining [fill in the blank], but I sense that is the underlying interest that drives these discussions.

@brianpk80 said:
@Smmenen said:
Yeah  I think your points flow from the mistaken assumption that the purpose of this thread is the narrow goal of identifying a perfect measurement that can then solve DCI policy making. No such measure exists. But it is important to have a discussion on the advantages and limitations of existing metrics, and to search for better approaches in the era of big data.
Oh I know that wasn't the explicit or avowed purpose of discussing which metrics should be used for determining [fill in the blank], but I sense that is the underlying interest that drives these discussions.
It's certainly a part of it, but I think there is another major one: and that's the availability of more/different data sources than ever before. It's not just that we have more data sources, we also have more kinds of data: 1) more total metagame data, 2) win %ages, and 3) daily event reports, which aren't commensurate to Top 8 data sets.
Given the multiplicity of data sets, there is a need to discuss and explore what the data means, what our measures tell us, and how we should think about them. I think that's the larger driver here.
FWIW, this isn't a problem limited to Magic. In almost every field, there are new (old) debates emerging over data largely because of the availability of new data sets.

@AaronPatten something i would like to say about the classification of decks is that while what cards are in it are important, i think how the deck plays is the best way to classify them, the best example i can think of is if a sylvan mentor deck were playing a single mana drain dosnt make it a mana drain deck. i think thats what you meant as one of the more involved ways but i still wanted to put that out there.