Model Selection

Lumen Learning; OpenStax

Model Selection

The best model is not always the most complicated. Sometimes including variables that are not evidently important can actually reduce the accuracy of predictions. In this section we discuss model selection strategies, which will help us eliminate variables from the model that are found to be less important.

In practice, the model that includes all available explanatory variables is often referred to as the full model. The full model may not be the best model, and if it isn’t, we want to identify a smaller model that is preferable.

Identifying variables in the model that may not be helpful

Adjusted R² describes the strength of a model fit, and it is a useful tool for evaluating which predictors are adding value to the model, where adding value means they are (likely) improving the accuracy in predicting future outcomes.

Let’s consider two models, which are shown in Tables 1 and 2. The first table summarizes the full model since it includes all predictors, while the second does not include the duration variable.

df = 136

Table 1. The fit for the full regression model, including the adjusted R².
	Estimate	Std. Error	t value	Pr( >\|t\|)
(Intercept)	36.2110	1.5140	23.92	0.0000
cond_new	5.1306	1.0511	4.88	0.0000
stock_photo	1.0803	1.0568	1.02	0.3085
duration	–0.0268	0.1904	–0.14	0.8882
wheels	7.2852	0.5547	13.13	0.0000
R²_adj = 0.7108

df = 137

Table 2. The fit for the regression model for predictors cond_new, stock_photo, and wheels.
	Estimate	Std. Error	t value	Pr( >\|t\|)
(Intercept)	36.0483	0.9745	36.99	0.0000
cond_new	5.1763	0.9961	5.20	0.0000
stock_photo	1.1177	1.0192	1.10	0.2747
wheels	7.2984	0.5448	13.40	0.0000
R²_adj = 0.7128

Example

Which of the two models is better?

Solution:

We compare the adjusted R² of each model to determine which to choose. Since the first model has an R²_adj smaller than the R²_adj of the second model, we prefer the second model to the first.

Will the model without duration be better than the model with duration? We cannot know for sure, but based on the adjusted R², this is our best assessment.

Two model selection strategies

Two common strategies for adding or removing variables in a multiple regression model are called backward elimination and forward selection. These techniques are often referred to as stepwise model selection strategies, because they add or delete one variable at a time as they “step” through the candidate predictors.

Backward elimination starts with the model that includes all potential predictor variables. Variables are eliminated one-at-a-time from the model until we cannot improve the adjusted R². The strategy within each elimination step is to eliminate the variable that leads to the largest improvement in adjusted R².

Example

Results corresponding to the full model for the mario kart data are shown in Table 8.6. How should we proceed under the backward elimination strategy?

Solution:

Our baseline adjusted R² from the full model is R²_adj = 0.7108, and we need to determine whether dropping a predictor will improve the adjusted R². To check, we

fit four models that each drop a different predictor, and we record the adjusted R²

from each:

[latex]\begin{array}\text{Exclude . . .}\hfill&\text{cond_new}\hfill&\text{stock_photo}\hfill&\text{duration}\hfill&\text{wheels}\\\text{ }\hfill&R^2_{adj}=0.6626\hfill&R^2_{adj}=0.7107\hfill&R^2_{adj}=0.7128\hfill&R^2_{adj}=0.3487\end{array}[/latex]

The third model without duration has the highest adjusted R² of 0.7128, so we compare it to the adjusted R² for the full model. Because eliminating duration leads to a model with a higher adjusted R², we drop duration from the model. Since we eliminated a predictor from the model in the first step, we see whether we should eliminate any additional predictors. Our baseline adjusted R² is now R²_adj = 0.7128. We now fit three new models, which consider eliminating each of the three remaining predictors:

[latex]\begin{array}\text{Exclude duration and . . .}\hfill&\text{cond_new}\hfill&\text{stock_photo}\hfill&\text{wheels}\\\text{ }\hfill&R^2_{adj}=0.6587\hfill&R^2_{adj}=0.7124\hfill&R^2_{adj}=0.3414\end{array}[/latex]

None of these models lead to an improvement in adjusted R², so we do not eliminate any of the remaining predictors. That is, after backward elimination, we are left with the model that keeps cond new, stock photos, and wheels, which we can summarize using the coefficients from Table 2:

[latex]\displaystyle\hat{y}=b_0+b_1x_1+b_2x_2+b_4x_4[/latex]

[latex]\displaystyle\widehat{\text{price}}=36.05+5.18\times\text{cond_new}+1.12\times\text{stock_photo}+7.30\times\text{wheels}[/latex]

The forward selection strategy is the reverse of the backward elimination technique. Instead of eliminating variables one-at-a-time, we add variables one-at-a-time until we cannot find any variables that improve the model (as measured by adjusted R²).

Example

Construct a model for the mario kart data set using the forward selection strategy.

Solution:

We start with the model that includes no variables. Then we fit each of the possible models with just one variable. That is, we fit the model including just cond_new, then the model including just stock photo, then a model with just duration, and a model with just wheels. Each of the four models provides an adjusted R² value:

[latex]\begin{array}\text{Add . . .}\hfill&\text{cond_new}\hfill&\text{stock_photo}\hfill&\text{duration}\hfill&\text{wheels}\\\text{ }\hfill&R^2_{adj}=0.3459\hfill&R^2_{adj}=0.0332\hfill&R^2_{adj}=0.1338\hfill&R^2_{adj}=0.6390\end{array}[/latex]

In this first step, we compare the adjusted R² against a baseline model that has no predictors. The no-predictors model always has R²_adj = 0. The model with one predictor that has the largest adjusted R² is the model with the wheels predictor, and because this adjusted R² is larger than the adjusted R² from the model with no predictors (R²_adj = 0), we will add this variable to our model.

We repeat the process again, this time considering 2-predictor models where one of the predictors is wheels and with a new baseline of R²_adj = 0.6390:

[latex]\begin{array}\text{Add wheels and . . .}\hfill&\text{cond_new}\hfill&\text{stock_photo}\hfill&\text{duration}\\\text{ }\hfill&R^2_{adj}=0.7124\hfill&R^2_{adj}=0.6587\hfill&R^2_{adj}=0.6528\end{array}[/latex]

The best predictor in this stage, cond new, has a higher adjusted R² (0.7124) than the baseline (0.6390), so we also add cond_new to the model.

Since we have again added a variable to the model, we continue and see whether it would be beneficial to add a third variable:

[latex]\begin{array}\text{Add wheels, cond_new, and . . .}\hfill&\text{stock_photo}\hfill&\text{duration}\\\text{ }\hfill&R^2_{adj}=0.7128\hfill&R^2_{adj}=0.7107\end{array}[/latex]

The model adding stock photo improved adjusted R² (0.7124 to 0.7128), so we add stock_photo to the model.

Because we have again added a predictor, we check whether adding the last variable, duration, will improve adjusted R². We compare the adjusted R² for the model with duration and the other three predictors (0.7108) to the model that only considers wheels, cond_new, and stock photo (0.7128). Adding duration does not improve the adjusted R², so we do not add it to the model, and we have arrived at the same model that we identified from backward elimination.

Model Selection Strategies

Backward elimination begins with the largest model and eliminates variables one-by-one until we are satisfied that all remaining variables are important to the model. Forward selection starts with no variables included in the model, then it adds in variables according to their importance until no other important variables are found.

There is no guarantee that backward elimination and forward selection will arrive at the same final model. If both techniques are tried and they arrive at different models, we choose the model with the larger R²_adj; other tie-break options exist but are beyond the scope of this book.

The p-Value Approach, an Alternative to Adjusted R²

The p-value may be used as an alternative to adjusted R² for model selection.

In backward elimination, we would identify the predictor corresponding to the largest p-value. If the p-value is above the significance level, usually α = 0.05, then we would drop that variable, refit the model, and repeat the process. If the largest p-value is less than α = 0.05, then we would not eliminate any predictors and the current model would be our best-fitting model.

In forward selection with p-values, we reverse the process. We begin with a model that has no predictors, then we fit a model for each possible predictor, identifying the model where the corresponding predictor’s p-value is smallest. If that p-value is smaller than α = 0.05, we add it to the model and repeat the process, considering whether to add more variables one-at-a-time. When none of the remaining predictors can be added to the model and have a p-value less than 0.05, then we stop adding variables and the current model would be our best-fitting model.

Try It

Examine Table 2, which considers the model including the cond_new, stock_photo, and wheels predictors. If we were using the p-value approach with backward elimination and we were considering this model, which of these three variables would be up for elimination? Would we drop that variable, or would we keep it in the model?

Solution:

The stock photo predictor is up for elimination since it has the largest p-value. Additionally, since that p-value is larger than 0.05, we would in fact eliminate stock photo from the model.

While the adjusted R² and p-value approaches are similar, they sometimes lead to different models, with the adjusted R² approach tending to include more predictors in the final model. For example, if we had used the p-value approach with the auction data, we would not have included the stock photo predictor in the final model.

When to use the adjusted R² and when to use the p-value approach

When the sole goal is to improve prediction accuracy, use adjusted R². This is commonly the case in machine learning applications.

When we care about understanding which variables are statistically significant predictors of the response, or if there is interest in producing a simpler model at the potential cost of a little prediction accuracy, then the p-value approach is preferred.

Regardless of whether you use adjusted R² or the p-value approach, or if you use the backward elimination of forward selection strategy, our job is not done after variable selection. We must still verify the model conditions are reasonable.

License

Icon for the Creative Commons Attribution 4.0 International License

Identifying variables in the model that may not be helpful

Example

Two model selection strategies

Example

Example

Model Selection Strategies

The p-Value Approach, an Alternative to Adjusted R2

Try It

When to use the adjusted R2 and when to use the p-value approach

License

Share This Book

The p-Value Approach, an Alternative to Adjusted R²

When to use the adjusted R² and when to use the p-value approach