Ensemble methods utilize a set of predictive models each of which output a weighted prediction and average their votes for a single prediction. Random Forest, a popular ensemble algorithm, uses a bunch of decision trees generated via *bagging*. Being a total noob, I never understood why increasing the number of decision trees reduced the *variance* of a model in a ‘mathematical-proof’ sense. I came across a paper, Generalized Ambiguity Decomposition for Understanding Ensemble Diversity (2013) by Dr. Kartik Audhkhasi which mentioned something called *ambiguity decomposition*. Following its citation lead me to an old NIPS paper from the 90s, Neural network ensembles, cross validation and active learning by* *Anders Krogh and Jesper Vedelsby. Turns out, there was a more intrinsic concept that I had been overlooking, which dealt with how the estimators in an ensemble varied with one another. I will probably leave the classic bias-variance tradeoff discussion of random forests for another post. I had never heard of quantifying ambiguity, so we are going to talk about that.

##### Ambiguity Decomposition

For a given sample input \(x \), our ensemble prediction is represented as \(f(x) = \sum\limits_{i=1}^M {w_i}{f_i}(x) \) with M total estimators and \( w_i \) indicating the importance of its respective estimator’s output \( {f_i}(x) \). The weights are all positive and add up to one. The ambiguity \({a_i}(x)\) for each estimator is defined by:

\({a_i}(x) = ({f_i}(x) {}-{} f(x))^2\)

which is simply the squared difference between the prediction of a single model and the ensemble it belongs to for a given \(x\). Hence, we may then define the *ensemble ambiguity* as:

\({a}(x) =\sum\limits_{i=1}^M {w_i}{a_i(x)}\)

This summation over the ambiguities allows us to see the ensemble ambiguity as the variance of the ensemble around the weighted mean. Let’s define the squared error of a single estimator and its ensemble against the true value \(y(x)\) by the following:

\({e_i}(x) = (y(x) {}-{} {f_i}(x))^2\)

\({e}(x) = (y(x) {}-{} f(x))^2\)

This helps us redefine the ensemble ambiguity as:

\({a}(x) =\bar{e}(x) {}-{} e(x)\) where \({}\bar{e}(x) = \sum\limits_{i=1}^M {w_i}{e_i}(x)\).

Our error can simply be stated as such:

\({e}(x) =\bar{e}(x) {}-{} a(x)\)

If we simply integrate \( a_{i}(x),{}{e_i}(x),{}e(x)\) over the distribution of the dataset* p(x)*, we get the following expressions respectively:

\( A_{i} = \int_{x} p(x) {a_i}(x)dx\)

\( E_{i} = \int_{x} p(x) {e_i}(x)dx\)

\( E = \int_{x} p(x) e(x)dx\)

Now if we denote the weighted average of the generalization errors of each estimator as \(\bar{E} =\sum\limits_{i=1}^M {w_i}{}E_{i}\) and the weighted average of the ambiguities as \(\bar{A} =\sum\limits_{i=1}^M {w_i}{}A_{i}\), then we can obtain our final expression for the generalization ensemble error as:

\(E = \bar{E} {}-{} \bar{A}\).

This is the ambiguity decomposition. We have beautifully represented the error with two major terms. The first term indicates the weighted squared average error of all M estimators. The second term concerns itself with the weighted ‘squared spread’ about the ensemble. The paper addresses the second term as “all correlations” between the estimators. Hence, increasing ambiguity – in other words, randomizing each estimator as much as possible in order to maximize spread helps reduce the error. If the first term dominates the second term, then our ensemble is said to be strongly biased. This is alluding to the bias-variance tradeoff, but shifting the paradigm. A strongly biased ensemble in this sense indicates that our estimators are pretty similar, and therefore our efforts to trust this ensemble is futile. Also notice that the second term is always positive, hence the generalization error of the ensemble is less than the average error of the ensemble’s components.

Remember when I said increasing the number of decision trees reduces the *variance* of the random forest model? We want to reduce the variability of the model given a dataset. An overfitted model will change drastically if the data changes even slightly. **So, essentially, we want to maximize the variance of the estimators about their ensemble in order to minimize the variance of our ensemble with regards to its fitting against data. **Sounds a bit tricky, huh?

The paper I mentioned first introduces a more generalized theorem called Generalized Ambiguity Decomposition (GAD). Check it out! I might make a separate post on the bias-variance decomposition of ensemble models where we look at asymptotic behaviors with respect to correlations between estimators, or the size of the ensemble.

Anyhow, I’m starting to get into the habit of reading more papers and finding these simple yet interesting concepts from older papers. I hope this was an insightful post. Feel free to leave a comment 🙂